We started monitoring our web server using Pingdom and found out that we have a downtime of a few minutes every Sunday at 0:00 UTC.
The test runs every minute and checks if a successful HTTP response (code 200) is returned on port 80. The test fails due to a timeout (no response after 30 seconds).
Here's what we've already checked – without success:
Since we run our webserver behind a load balancer, I've set the Pingdom test on the load balancer's public DNS and the webserver's public DNS in order to find out if there's a problem with the AWS load balancer – both tests return the same result
We set up Munin on our webserver. Everything looked fine even after the failure. Since the last failure lasted only 2 minutes I suppose Munin couldn't capture a potential problem (it only checks every 5 minutes)
I have checked /var/log/apache2/error.log and /var/log/syslog for suspicious entries
I have checked /etc/cron.weekly and /etc/crontab for suspicious entries
I have searched for files created or last-modified during 0:00 and 0:15 using this method:
touch -t 201209020000 start
touch -t 201209020015 end
find / -newer start -and ! -newer end(nothing found)
Has anybody experienced a similar problem? Any proposals on how to find the reason for this behavior?
It's Ubuntu 10.04 LTS running on an AWS m1.large instance.
Thanks!
There are some reports out, that the update-apt-xapi process takes lot of cpu usage for couple of minutes. It runs on a weekly schedule. It can take your box down, if the regular load is also high. The command runs update-apt-xapian-index to update the index of software packages.
See few hints for workarounds here: http://empoccz.wordpress.com/2012/01/02/ubuntu-update-apt-xapi-takes-lot-of-cpu-usage-ii/ or https://askubuntu.com/questions/79481/is-100-cpu-usage-harmful-while-update-apt-xapi-runs