My physical server hangs randomly and I do not get any log entries during that time. So I let a script run:
while true; do date >>/tmp/dates; sleep 1; done & disown
After some minutes the file /tmp/dates has "holes". Sometimes the dates are interrupted for 30 seconds, sometimes for 3 minutes in a row:
Thu Feb 13 14:54:39 CET 2014
Thu Feb 13 14:54:40 CET 2014
Thu Feb 13 14:57:45 CET 2014
Thu Feb 13 14:57:46 CET 2014
The server does not show high load or memory utilization. /var/log/messages does not show anything for the times when the server hangs. However it does have some messages about IO problems like a fibrechannel link taking a nap. I am booting from SAN, so / is on a multimapper device. If I call
while true; do date >>/tmp/dates; sleep 1; done & disown
while true; do date >>/dev/shm/dates; sleep 1; done & disown
There will be no holes in /dev/shm/dates, but there will be holes in /tmp/dates so I think it is not a kernel hang situation, but an IO hang situation.
I can see no processes in uninterruptible sleep state in top. No entries in the hardware or storage log that seem relevant.
However iostat shows high latency spikes regarding await:
03/28/14 12:32:00
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 214.00 0.00 0.00 100.00
03/28/14 12:32:01
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 214.00 0.00 0.00 100.00
03/28/14 12:32:02
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 214.00 0.00 0.00 100.00
03/28/14 12:32:03
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 1219.00 0.00 4.76 8.00 45.61 5251.48 0.82 100.40
03/28/14 12:32:04
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.99 0.00 0.00 99.60
03/28/14 12:32:05
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.01 0.00 0.00 100.40
03/28/14 12:32:06
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.00
Could it be that the kernel is waiting for IO from the storage and does not do anything until the block arrives? What else could it be? How can I find out?
Looking at the FC-switch's error log we found one port with a huge error log. We disabled this port and have not seen the error for a whole weekend now. So I think this error has been caused by a broken fibrechannel path. Usually, to my information, such an error should be detected (e.g. by multipathd), logged and auto-corrected. So I assume we have one of the cases of an undetected fibrechannel path fault here.