Seems every day a website I manage has been going online and offline between 12a and 12:25a. I have no idea what is causing the issue so I am seeking guidance on where to start. It is a Wordpress based site.
So here is what I DO know:
I have a pingdom account which alerts me when the site goes offline so we can see every day, like clockwork, the site goes on/off.
At the time of the ups/downs I see a lot of strain on the memory usage. Look at the load average when the site is going online/offline (http://screencast.com/t/BRlfXkqrbJII). Then I ran this command to restart http (http://screencast.com/t/usVtYWZ2Qi) and the memory usage then goes down to this (http://screencast.com/t/VdTIy3bgZiQB). An hour after I restarted http, the site then went offline/online so restarting the http didn't do much help.
When the site is going offline/online, I ran the top command and get this (http://screencast.com/t/zEwr7YQj3). Here is a top command when the site is at it's lowest (http://screencast.com/t/eaMfha9lbT - so this would be dubbged "normal").
I have removed all cron scripts that are on my server (backups, etc). I also have removed every single cron within my Wordpress install. So in theory nothing is running at all.
Here is a bandwidth report (http://screencast.com/t/AS0h2CH1Gypq).
The traffic doesn't seem to be that much (http://screencast.com/t/s7hrWNNic1K), but looking at my times the site is going up/down this may be one of the reasons?
I have the dvp Nitro package at Media Temple (http://mediatemple.net/webhosting/nitro/).
So at this point I would request some help in trying to figure out what the cause of this is, and how I can go about pinpointing this issue. ANY HELP is greatly appreciated.
You need to look at more logs. Check
/var/log/messages
at around midnight (and perhaps /var/log/messages.0
,/var/log/messages.1
, etc. for previous nights). Look at your http.conf to find where your apache logs are stored (that file should be in/etc/http/conf
). The ErrorLog directive in that file will tell you where your apache error logging is going (typically an error_log file somewhere). Look at that file to see what it reports around midnight. Check other files in/var/log
for unusual activity you can correlate. Logfiles should tell you why the webserver is failing at midnight.According to the 'hits per hour' graph that you posted, you get 13,000+ requests in the midnight hour. This is your highest hour by far. When you do a 'service httpd restart' you see a warning message about your MaxClients exceeding your ServerLimit and it's lowering your MaxClients to 200. This means that you're allowing 200 httpd clients. Your httpd clients are consuming about 40M each. 200 * 40 = 8GB. Mysql is also taking up 300MB. The OS needs some too. You have no swap configured. Your disk cache is at zero at this time according to the 'top' output that you've posted, but there is a lot of memory free. That's kinda weird and it's throwing me for a loop.
Linux might be implementing the OOM killer. Check dmesg output for those signs. I'd suggest lowering your MaxClients and/or increasing the amount of RAM (or possibly adding CPU power.) You can also look in your apache logs to find out what is hitting your site at this hour. If it is legitimate traffic then increasing the RAM/CPU is the way to go. If it isn't, then mitigation is the path to take.
Are you being spidered too aggressively?
Check your Apache logs and try making some adjustments to your robots.txt:
Cheers
May I suggest that you set up cron jobs that perform periodic monitoring during that time? Set up a script that outputs the CPU usage, memory usage, etcetera during that time of your services. You might also want to add a ping to that periodic script so that you can ensure that your server has a working network connection during the outage. The last thing I'd add to that periodic diagnostic script is a wget request to your site during the downtime, across the localhost interface.
It's possible that other systems at your hosting provider may be causing these problems - it may not be your server at all. Hopefully building a script to run server-side can give you additional diagnostic information, and help you to find the cause of the problem.
Is your server virtual? It's possible that your provider performs various snapshotting (from DomU) at that time which may freeze the other domains.
What time do your logs rotate? If they rotate around midnight, and this is a shared hosting server, then the log rotation itself may cause a lot of load and cause your site to go down.
Here's an option to look at: i=0 while [ $i < 86400 ]; do top -b >> /tmp/top_file sleep 60 $i++ done
This will run top in batch mode once a minute for an entire day and give you a bunch of possibly useful information. You need to look at CPU utilization, disk io utilization and memory/swap usage.
Also, your hosting package appears to be a VPS. Maybe your VPS doesn't have a problem, but your base OS does? A snapshot style daily backup of the virtual disk image may take 5 minutes?
Hmm... if you don't have any cron scripts or other processes that may cause those reboots, how about asking the manager of the physical host to check if the server is having some hiccups at midnight?