this is my first post to this forum which I found through the superb podcast "It Conversations" from StackOverFlow. I am quite in my role as server administrator for an exhibition center in London. Basically we have a central file and sql server to which roughly 40 stations connects to to upload/download data used/captured by a set of applications.
Over the last weeks we have experienced a few random hangups to our applications, and as it always happen to multiple applications simultaneously I do not believe that the applications are the source of the problem. We also monitor the network using Dartware Intermapper which indicates that all switches and stations on the network has been reachable during the downtime. Thus, its all pointing to the server.
I have been looking through all log files I can think of and the only thing so far that I have found suspicious is the following lines in the syslog which are from the time of one of the hangups:
Feb 6 17:14:27 es named[5582]: client 127.0.0.1#33721: RFC 1918 response from Internet for 150.0.168.192.in-addr.arpa
Feb 6 17:14:40 es named[5582]: client 127.0.0.1#32899: RFC 1918 response from Internet for 152.0.168.192.in-addr.arpa
Feb 6 17:15:01 es /USR/SBIN/CRON[1956]: (es) CMD (/home/es/apps/es/bin/es_checksum.sh)
Feb 6 17:16:06 es /USR/SBIN/CRON[2031]: (es) CMD (/home/es/apps/es/bin/es_checksum.sh)
Feb 6 17:21:00 es named[5582]: *** POKED TIMER ***
Feb 6 17:21:00 es last message repeated 2 times
Feb 6 17:21:07 es named[5582]: client 127.0.0.1#44194: RFC 1918 response from Internet for 143.0.168.192.in-addr.arpa
Feb 6 17:21:12 es named[5582]: client 127.0.0.1#59004: RFC 1918 response from Internet for 164.0.168.192.in-addr.arpa
I find a few lines of interesting lines here:
1) "RFC 1918 response from Internet for 150.1.168.192.in-addr.arpa". I see this a lot in the syslog. And basically everytime I do a nslookup for any of the computers in the cluster I get a new similar line in the syslog. I understand from google that this has to do with reverse lookup problems. But I do not know how that could effect the systems. Lets say that one of these lines appear every time one of the userstations connects to the server, which may happen several times a second. Could this possible cause a hangup of the entire server?
2) POKED TIMER, I have googled this quite a lot, but not found an explaination that I can relate to. What does this mean?
3) The timestamps, it seems like the entire server has stopped responding for several minutes. Normally there are many printouts to the syslog per minute on this server. Furthermore the CRON job is set to run once every minute. Which according to the log, hasent happened here.
OS: Ubuntu 8.04 Kernel: Linux 2.6.24-24-server x86_64 GNU/Linux. Hardware: Dell R710, RAID1, CPU: 2x XEON E5530. 16GB Memory. Average load is very low, and memory should not be a problem.
Please let me know if you need any additional information.
Best wishes
It's a very strange and bad situation. I never saw an host that stopped working for 5 minutes and then worked again without trouble and records in logs. Are you really sure that there are nothing in logs? What last tells? I'm not sure, but I don't think that the anomalies that you report in syslog are relevant for your problem. Do you have data about the time when in the syslog there aren't records? sysstat doesn't tell anything about those five minutes? If it doesn't because it isn't installed you could install it. And also the other logs have a gap between 17:16 and 17:21?
The problems were related to a incompability issue between Ubuntu 8.04 LTS (Hardy) and the Dell PERC 6/i RAID controller, as reported in this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/607167 Upgrading to Ubuntu 10.04 LTS Lucid (kernel 2.6.32) resolves the issue.
In case anyone else runs into the same issues.