Seems today a website I manage has been going online and offline throughout the entire day. I have no idea what is causing the issue so I am seeking guidance on where to start. It is a Wordpress based site.
So here is what I DO know:
I use a program that pings the server every minute and when the server is not responding me it emails me, so I can know exactly when the site is online and offline. The site between 8pm to 12pm 12.28, and around the 1a hour early morning 12.29 (New York City timezone, and all times below are in same timezone).
At the time of the ups/downs I see a lot of strain on the memory usage. Look at the load average when the site is going online/offline (http://screencast.com/t/BRlfXkqrbJII). Then I ran this command to restart http (http://screencast.com/t/usVtYWZ2Qi) and the memory usage then goes down to this (http://screencast.com/t/VdTIy3bgZiQB). An hour after I restarted http, the site then went offline/online so restarting the http didn't do much help.
When the site is going offline/online, I ran the top command and get this (http://screencast.com/t/zEwr7YQj3). Here is a top command when the site is at it's lowest (http://screencast.com/t/eaMfha9lbT - so this would be dubbged "normal").
Here is a bandwidth report (http://screencast.com/t/AS0h2CH1Gypq).
The traffic doesn't seem to be that much (http://screencast.com/t/s7hrWNNic1K), but looking at my times the site is going up/down this may be one of the reasons?
I have the dvp Nitro package at Media Temple (http://mediatemple.net/webhosting/nitro/).
So at this point I would request some help in trying to figure out what the cause of this is, and how I can go about pinpointing this issue. ANY HELP is greatly appreciated.
The load average in the 30s is high and the CPU is 100% busy. CPU usage is pretty even across your HTTPD processes so it's not one specific rogue process. Basically your server is not capable of handling the number of concurrent HTTP requests it is receiving.
It may be that you could do something to reduce the amount of processing needed to produce a page.
You could review the Apache server logs to try to see why the loading is so uneven. Perhaps you are being targetted by a DDOS attack - if so there are things you can do to mitigate the effects.
Either that or you need a bigger server.
Maybe MediaTemple have a problem - see How do I optimize a high traffic Wordpress website?
Then you're monitoring every network device between you and the system where the website resides. Indeed you're measuring just about everything except for the actual website. Sure if the computer running the webserver is unable to respond to pings it probably won't respond to HTTP requests either.
Looking at the other details provided, it does seem to be the HTTP processing which is causing problems - however thats one very badly setup server if its failing to respond to pings due to HTTP processing.
It might be a DOS attack - but I suspect its more likely that there's a race condition building somewhere. What's happening to your HTTP traffic? Are you getting bursts of activity from a few hosts? Does the response time lead or lag the load average?
The charts and reports you've provided help a little - but there's very little information here to base a diagnosis on - you really need to see your hot rate averaged by minute at most rather than by hour. And what about that huge spike at 0hrs? That looks odd to me.
While you could start logging %D and install/configure mod-log-firstbyte to see where the problem's arising, you can do all that a lot less invasively using a PHP auto-prepend, e.g. the following will write a log entry when processing starts and record lots of information about the work done in processing a request when it completes.