We've been having problems with our Samba fileserver running on Debian Etch with Linux kernel 2.6.16. It's an old Dell PowerEdge 2650 server, but it's never had a problem like this before, and the problem started this morning, without any configuration or other changes being made.
While the problem manifests in many ways, they could all possibly be explained by the open() system call being very slow to complete. Here is an strace of "cat logon.bat", where the file is on a local ext3 filesystem:
$ sudo strace -p 3548 -tt
Process 3548 attached - interrupt to quit
11:20:40.563088 open("logon.bat", O_RDONLY|O_LARGEFILE) = 3
11:21:00.070660 fstat64(3, {st_mode=S_IFREG|0664, st_size=44, ...}) = 0
11:21:00.070923 read(3, "cscript \\\\staff\\netlogon\\logon.v"..., 4096) = 44
11:21:00.085676 write(1, "cscript \\\\staff\\netlogon\\logon.v"..., 44) = 44
11:21:00.085906 read(3, "", 4096) = 0
11:21:00.086053 close(3) = 0
11:21:00.086222 close(1) = 0
11:21:00.086382 exit_group(0) = ?
Process 3548 detached
The timestamps show that the open() call took 20 seconds. (It was actually much longer, as the strace was started some time after the command was run.) But immediately subsequent runs of the same command don't have the slow open() call. But some time later, it's slow again.
The server has been restarted, and the problem continues. There's nothing being reported in kern.log, and the hardware isn't reporting any faults.
The server is still partially functioning, so we're not taking it down immediately. Outside of work hours, we'll be able to run more tests, including a forced fsck on the filesystem in question.
But we don't really have a good idea of what the problem might be, so we're looking for any theories of what might be wrong, as well as ideas of what tests to run to further diagnose the problem. Any suggestions?
Update
I should have pointed out that this particular filesystem is on an Apple Xserve RAID device (connected via FiberChannel). The RAID Admin tool is giving a green status light for all the drives, as well as for the array as a whole, and there aren't any events in the log that indicate any kind of problems.
Is this running on one of dell's raid controllers (looks like it would probably be a PERC/4something). If so, the megaraid kernel driver doesn't seem to react or report drive problems at all, you need to install Dell's OpenManage stuff to see what is going on at the hardware level. This thread suggests once you install it you'd use commands like
Here is Dell's documentation on omreport.
The newer Megaraid SAS controllers (PERC/5) can use MegaCLI alone to manage them.
Holy hard-disks, batman! Thats SLOW!
This really does look like low level hardware problems on the hard disks. I expect if you connect up a different drive (usb, cdrom, local SATA IDE) that you don't see these problems? If you've not already tried that, I recommend you do so.
If you still see problems with different disks, then a reinstall of the OS might be worth trying (or just boot it up from a knoppix image/similar to test). It might also be helpful to see the mount options and the output of 'free'.