Ping a Specific Port

Question

Thorsten Staerk

Asked: 2014-02-14 12:53:39 +0800 CST2014-02-14 12:53:39 +0800 CST 2014-02-14 12:53:39 +0800 CST

server hangs randomly without log entries

772

My physical server hangs randomly and I do not get any log entries during that time. So I let a script run:

while true; do date >>/tmp/dates; sleep 1; done & disown

After some minutes the file /tmp/dates has "holes". Sometimes the dates are interrupted for 30 seconds, sometimes for 3 minutes in a row:

Thu Feb 13 14:54:39 CET 2014
Thu Feb 13 14:54:40 CET 2014
Thu Feb 13 14:57:45 CET 2014
Thu Feb 13 14:57:46 CET 2014

The server does not show high load or memory utilization. /var/log/messages does not show anything for the times when the server hangs. However it does have some messages about IO problems like a fibrechannel link taking a nap. I am booting from SAN, so / is on a multimapper device. If I call

while true; do date >>/tmp/dates; sleep 1; done & disown
while true; do date >>/dev/shm/dates; sleep 1; done & disown

There will be no holes in /dev/shm/dates, but there will be holes in /tmp/dates so I think it is not a kernel hang situation, but an IO hang situation.

I can see no processes in uninterruptible sleep state in top. No entries in the hardware or storage log that seem relevant.

However iostat shows high latency spikes regarding await:

03/28/14 12:32:00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:01
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:02
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00   214.00    0.00   0.00 100.00

03/28/14 12:32:03
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00 1219.00     0.00     4.76     8.00    45.61 5251.48   0.82 100.40

03/28/14 12:32:04
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     2.99    0.00   0.00  99.60

03/28/14 12:32:05
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     3.01    0.00   0.00 100.40

03/28/14 12:32:06
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     3.00    0.00   0.00 100.00

Could it be that the kernel is waiting for IO from the storage and does not do anything until the block arrives? What else could it be? How can I find out?

1 Answers

Voted

Thorsten Staerk · Answer 1 · 2014-04-01T01:33:42+08:00

Best Answer

Thorsten Staerk

2014-04-01T01:33:42+08:002014-04-01T01:33:42+08:00

Looking at the FC-switch's error log we found one port with a huge error log. We disabled this port and have not seen the error for a whole weekend now. So I think this error has been caused by a broken fibrechannel path. Usually, to my information, such an error should be detected (e.g. by multipathd), logged and auto-corrected. So I assume we have one of the cases of an undetected fibrechannel path fault here.

0

server hangs randomly without log entries

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?