Yesterday my web sites were down for a short time. I logged on my server and my first reaction was restart the apache web server. After that everything was working fine. So I start to check the ganglia metrics to see what happened. It was clear that one minute before I restarted apache, the number of requests to the web server was very high, surpassing the limits of Apache and blocking other requests.
I checked manually the apache logs, filtering the traffic minutes before and after the restart. There were no signs of something wrong. I also analyze the logs with some tools (awstats, bots script, etc) with similar results. I do the same with the error logs, checking carefully for some strange behaviour. No success.
So I'm pretty sure the problem was a sudden increment of the requests to the apache web server. But I don't know how this happened, if this was an attack, some nasty bug, a problem in the application, or something else I don't know. What will you do if something similar happens in your web server? What other tools you use? What other logs you check? Was it wrong restart the web server as the first measure to solve the problem?
Re: restart the server as a first measure...Another wonderful example of "It Depends" :-)
If this is a system that is a MUST BE UP server, I don't think I'd reboot it first.
I'd go through logs, maybe have a tail -f on the apache log to see what is hitting it in realtime.
I'd also probably open another window and check if there's anything suspicious via wireshark, just to see what traffic is hitting (and leaving) the system.
Otherwise, check system load, drive activity, process list, network card activity to verify that it's traffic and not software related. Check memory/swap usage. Check number of Apache processes and see if they're maxed out.
Rebooting shouldn't be necessary most of the time, and while it cleared the issue up apparently it doesn't solve what caused the problem meaning you may get another call (possibly at an even more inconvenient time) to hurry and fix it again. No server should have to be maintained with a periodic unscheduled reboot.
The reboot may be a way to get higher-ups or users off your back when the heat is on, but on the other hand you may have lost the chance to figure out what exactly was going haywire. It's odd that an attack would have suddenly stopped with a reboot, unless it was a portscan or web server scanner at which point your server "disappearing" may have signaled it to move on.
If it's a system that must be up all the time, you may need to consider a failover and load balancing solution of some sort. This would also help with troubleshooting and allowing you more flexibility in diagnosing issues without losing connectivity (although you need to have more automated monitoring to tell you that system A is having trouble but the site is still working thanks to B so the users won't tell you there's an issue).
I have a crude but somewhat effective measure I use on an apache webserver exhibiting similar symptoms. I have a cron job which every minute runs
This way if the apache server queue fills up with requests (which it sometimes does for unknown reasons: Apache gets "clogged" with certain requests) I have a log of what happened leading up to the requests. It's also helpful for troubleshooting moments of high load.
I also highly recommend Cacti and Nagios for monitoring