Situation:
The following strange problem has occurred on a single file server running OmniOS r151018 (95eaa7e) serving files over SMB to Windows and OS X guests.
Saving certain files (.docx, .xlsx, some images) through the "Save as..." dialog window on a SMB share results in a lag of about 3 to 5 seconds, where the application does not respond at all, afterwards the file is saved normally.
The problem did occur "over night", without doing anything to the server, but it is difficult to pinpoint the exact date, as user complaints only came in some time after the first occurrence. After a reboot of the server, one vdev of the mirrored root pool was unavailable, but closer inspection did not find any faults on the device and it was reattached to the pool. The problem still persists.
Some observations:
- It happens on all Windows 7 clients
- It happens for all file sizes
- It happens on all shares of this machine, regardless of permissions
- It happens for faster storage imported on the host over iSCSI from another server
- Normal copy speed is 110 MB/sec over GBit Ethernet
- Data and root pool seem to be fine
- It does not happen on other file servers
- It does not happen when the file is saved locally, then copied over through explorer
- It does not happen on OS X (could only test it with OpenOffice)
dmesg
shows several counts ofNOTICE: bge0: interrupt: flags 0x0 - not updated?
with various values, but this was also the case before and did no harm
Further ideads/plans:
As there is no clear error message to be found, I might need to do some trial and error searching for the cause. Some things I will consider (results are in italics):
- Replace the Broadcom network card with an Intel card => did not make a difference
- Replace the root pool with SATA SSDs (currently SLC memory USB sticks which worked fine for over 3 years) => did not make a difference
- Check the network in between (hardware, by connection directly to the server)
- Traffic capture with WireShark: difficult if you don't know what you are looking for exactly
- Revert to a previous OmniOS boot environment/version to rule out software conflicts => did not make a difference
- Roll back Windows/Office updates to rule out bugs
Remove files with
:
(colon) in filenames from snapshots, suggestion by txgsync on the reddit thread created by ewwhite => did not make a differenceI've seen something similar to this when the Windows "previous versions" feature is enabled with automatic snapshots that include a ":" character. Just shooting at the wind with this, but may be worth a look as the ":" character is not allowed in Windows file names.
Monitoring of file access: as suggested by shodanshok, I used
DTrace
and this script to monitor file access. I used it while saving the alread open file, removed unrelated output and personal information, and the result centers around three files:CPU ID FUNCTION:NAME 1 18753 fop_open:entry Open: Workbook 0 18181 fop_create:return Create: temp_1 0 18753 fop_open:entry Open: temp_1 0 18753 fop_open:entry Open: Workbook 0 18753 fop_open:entry Open: Workbook 0 18753 fop_open:entry Open: temp_1 0 18888 fop_rename:entry Rename: Workbook -> temp_2 0 18888 fop_rename:entry Rename: temp_1 -> Workbook 0 18753 fop_open:entry Open: Workbook 0 18753 fop_open:entry Open: temp_2 0 18892 fop_remove:entry Remove: temp_2 0 18753 fop_open:entry Open: Workbook 0 18753 fop_open:entry Open: Workbook
The same procedure on another server where the problem does not occur yields a similar result:
CPU ID FUNCTION:NAME 1 25182 fop_create:return Create: temp_1 1 25750 fop_open:entry Open: temp_1 1 25750 fop_open:entry Open: Workbook 1 25750 fop_open:entry Open: temp_1 1 25750 fop_open:entry Open: Workbook 1 25750 fop_open:entry Open: temp_1 1 25889 fop_rename:entry Rename: Workbook -> temp_2 1 25889 fop_rename:entry Rename: temp_1 -> Workbook 1 25750 fop_open:entry Open: Workbook 1 25750 fop_open:entry Open: temp_2 1 25893 fop_remove:entry Remove: temp_2 1 25750 fop_open:entry Open: Workbook 1 25750 fop_open:entry Open: Workbook 1 25750 fop_open:entry Open: Workbook
I also added timestamps (
walltimestamp
) to the script, but in both cases all file operations take place at the same second. => did not make a difference- Import disks on another host to check if pool fragmentation or disks are faulty => did not make a difference
- Move data and root pool over to identical machine to rule out cabling, mainboard etc. => problem does persist, so must be either the root pool (software) or a specific hardware that is incompatible with the software (or did suddenly become incompatible...)
Could you suggest anything else that be be the cause of this behavior? Or did you experience something similar? because I could not find anything helpful online, I suspect it is either a strange hardware problem (because it is limited to one machine) or a problem with Windows/Office.
Solution:
The problem only affects OmniOS r151018, not previous versions. This thread on the omnios-discuss mailing list was exactly about my problem, quote from Geoff:
So,
biteCount++;
I guess. The problem was solved by applying the fix and doing a fast reboot.Lessons for the future: before attempting any troubleshooting, just use the advanced search on the official mailing lists, because most likely your problem already occurred on someone else's machine. Also, spin up a quick VM to rule out any software, updates or configuration errors before looking for hardware errors.
How I got there:
After several different tests as seen in the updated question, I narrowed it down to either software problems or hardware/driver conflicts on the specific hardware. To rule out the second, I installed two fresh OmniOS virtual machines, r151018 and r151016 on another host and configured by hand a basic SMB share on each of them.
The r151018 experienced the problem, r151016 works fine. I suspect I did not notice it in my very first tests, because I only rolled back some updates on r151018, not back to an earlier release. I think the problem must have been existed longer than I assumed.
When looking for a way to only update packages one by one, I looked at the mailing list and searched for
smb
from the last 6 months, where the correct solution/same problem popped up, dated back from May.