I have CentOS 5.3 based server with kernel 2.6.18-128.2.1.el5. It worked fine nearly for a month, but this week it went down three times. I saw it in Nagios, write a email to reboot the server. It worked 12-36 hours and then went down again.
I look through log files. Just before first fault in /var/log/messages
was this message:
logrotate: ALERT exited abnormally with [1]
After rebooting the server the second time the sysadmin from datacenter send me this screenshot:
alt text http://www.freeimagehosting.net/uploads/bd9fb68d98.png
Before the third fault in /var/log/messages
was message:
Eeek! page_mapcount(page) went negative (-1)
How should I investigate the problem?
UPD:
Part of the memtester
output:
Compare OR : FAILURE: 0x7e9f90d1 != 0x7e9fd2d1 at offset 0x06222609. FAILURE: 0x7e9f90d1 != 0x7e9fd0d1 at offset 0x06222621. FAILURE: 0x7e9f90d1 != 0x7e9fd1d1 at offset 0x06222661. FAILURE: 0x7e9f90d1 != 0x7e9f92d1 at offset 0x06222681. FAILURE: 0x7e9f90d1 != 0x7e9fd0d1 at offset 0x062226a1. FAILURE: 0x7e9f90d1 != 0x7e9fd0d1 at offset 0x062226c1. FAILURE: 0x7e9f90d1 != 0x7e9f93d1 at offset 0x062226e9.
It is faulty memory. Thank you for help!
My first guess is that Nagios has a small memory leak and after months of running ran out of RAM or swap. However, since the machine has crashed a few times in the same day, that suggests a faulty RAM chip. My first step would be to do a memory test or check the bad memory log (if your server supports it).
I vote faulty ram too. I would recommend using memtest86 to do a thorough check of the ram. Also, are the temperatures in the room nice and cool?
I vote faulty RAM too. If you cannot use memtest86 because the machine is remotely located, you may want to try a userspace tool - memtester, instead. It doesn't work quite as well but may be able to pick up some memory errors if they are there.
Just a quick glance it looks like the process that paniced was Nagios. Has that been consistent every time it's paniced and locked up? If so I would ask if the problems started around the time you setup Nagios. If that's the case then you might want to try shutting Nagios down and see if the server returns to be stable. If it does then you have found the culprit and need to look closer to see what's wrong with Nagios.
Google or Centos forums/list are likely to be you best bet. Without a crsah dump it's going to be difficult to be sure, so you should look into getting that configured.
You can also search through Redhat bugzilla. This looks a possibility based on the little you have from the screen shot.