I have a RHEL5 workstation that has recently started to "hiccup". About every thirty seconds, it apparently completely stops execution for about 4 seconds. Seemingly nothing runs during that period. Long term processes seem to catch up to their input, but new processes simply don't get started.
Concrete examples:
I have this loop running in a shell:
while date; do sleep 0.2 done
Output merely skips over the missing seconds:
Fri Aug 13 15:20:29 EDT 2010 Fri Aug 13 15:20:29 EDT 2010 Fri Aug 13 15:20:29 EDT 2010 Fri Aug 13 15:20:30 EDT 2010 Fri Aug 13 15:20:30 EDT 2010 Fri Aug 13 15:20:30 EDT 2010 Fri Aug 13 15:20:30 EDT 2010 Fri Aug 13 15:20:34 EDT 2010 Fri Aug 13 15:20:34 EDT 2010 Fri Aug 13 15:20:35 EDT 2010 Fri Aug 13 15:20:35 EDT 2010 Fri Aug 13 15:20:35 EDT 2010
If typing in a terminal, either local console or remote via ssh or telnet, echoback pauses during the unresponsive time, but catches back up when it starts responding again, with apparently no loss of input, just lag.
ping
s go unresponded-to during the unresponsive time, but are responded to when it comes back:64 bytes from xxx: icmp_seq=1911 ttl=64 time=0.203 ms 64 bytes from xxx: icmp_seq=1912 ttl=64 time=0.199 ms 64 bytes from xxx: icmp_seq=1913 ttl=64 time=3202 ms 64 bytes from xxx: icmp_seq=1914 ttl=64 time=2196 ms 64 bytes from xxx: icmp_seq=1915 ttl=64 time=1197 ms 64 bytes from xxx: icmp_seq=1916 ttl=64 time=195 ms 64 bytes from xxx: icmp_seq=1917 ttl=64 time=0.201 ms 64 bytes from xxx: icmp_seq=1918 ttl=64 time=0.206 ms
This would seem to imply that it is actually receiving input during the unresponsive period, as those ICMP packets are not being retransmitted.
vmstat 1
output also delays, but does not catch up. It's almost as if those few seconds didn't happen. It also shows an uptick in waiting processes, and a downtick in interrupts and context switches:procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 132 3111220 305540 588012 0 0 0 0 1035 151 1 1 99 0 0 0 0 132 3111096 305540 588012 0 0 0 0 1019 125 0 0 99 0 0 0 0 132 3111220 305540 588012 0 0 0 44 1034 154 0 1 99 0 0 1 0 132 3111096 305540 588012 0 0 0 0 1016 131 0 0 99 0 0 6 0 132 3111096 305540 588012 0 0 0 0 417 82 0 0 100 0 0 0 0 132 3111220 305540 588012 0 0 0 0 1041 155 0 1 99 0 0 0 0 132 3111096 305540 588012 0 0 0 0 1019 123 1 1 99 0 0 0 0 132 3111220 305540 588012 0 0 0 0 1032 142 0 1 99 0 0 0 0 132 3111096 305544 588008 0 0 0 44 1019 134 0 0 99 0 0
Rebooting makes the problem go away for a while. This most recent time it took six days to come back. I'm not sure if that's consistent or not.
I had initially suspected that the problem might be related to the nVidia video driver module, but I shut down X Windows and removed the module, without change in the symptoms.
There is nothing in dmesg or /var/log/messages that seems remotely relevant or in any way coincides with the hiccups. It does not appear to be an issue with a hard drive, as I would expect iowait to be prominent during the unresponsive period if that were the case, but it's not. It feels unlikely to be a hardware problem, as the hiccups are pretty regular. I've been unable to time them down to milliseconds, but it's a pretty consistent 30/4/30/4/30/4.
Any ideas?
My money still goes on a hard disk failure. I've had similar things occur in personal Windows desktops. And even an old Sun machine exhibited similar freeze issues. However, I won't claim I dug deep enough into the issue to notice the seconds dropping from a sleeping shell. Regardless, you might want to see if you can get any info out of your RAID controller, or otherwise rule out the harddisks.
My server has hiccups, too. I found this tool: http://www.latencytop.org/. Unfortunately my hiccups are not occurring regularly.