I'm not normally a sysadmin, but I've got a production server under heavy load (serving some basic php pages, and some php redirect files that have some sql queries, and no images) that keeps crashing. Specifically, the load gets up to about 20 and requests time out. There's nothing in the apache access log or error log indicating unusual activity but the disk IO chart shows heavy read/write spikes that correlate with our downtime.
I know it's some combination of these pages and a few hundred thousand hits an hour, but I'm stumped, and I don't know which tools to use. I need to see A) How many hits per second/minute/hour these pages are getting and B) How long it's taking to serve each page. What's available to profile a live server under load? What's best?
The server is apache2, php5, ubuntu hardy. Any advice at all is greatly appreciated.
EDIT:
Thanks for the ideas. I could edit the PHP, but these are pages that designers are changing often, they like to copy/paste/delete things, and I was hoping to find something better than ducttape for this because it's a recurring issue on a lot of our servers.
Are there really no software packages for monitoring server load per-file on production servers? Do I have to resort to debugging tools and per-code-segment profiling? If my server's already choking on hits, wouldn't adding XDebug royally F*!#-up my S@^&?
The easiest 1st thing to try would be identifying the most "popular" PHP files requested.
You can do that examining Apache's access.log file(s), or using something like apachetop in real-time (although it also relies on log files).
You can examine Apache server status using mod_status - it will also show you what exactly is using Apache's CPU cycles. There is a lot of information out there on using it to identify CPU-intensive requests.
After you have a list of "candidates for optimization", you could indeed use XDebug individually on each one.
As a simpler option, you could install xcache or APC or any other PHP caching software. It does significantly speedup PHP scripts execution.
You can try Apache module modlogslow to get measures of the time period used for handling each request by the current process.
You should try Xdebug's profiling feature. You can install it as a module then turn on profiling to create the profile file. After collecting data you can use WinCacheGrind or some other Valgrind processing to see where your time is being spent. There are other options for PHP profiling as well.
Regarding "A) How many hits per second/minute/hour these pages are getting" - this information will be in the logs and just about any log parser and/or web analysing stats package will look at this for you. The common free/OSS ones are listed here.
For "B) How long it's taking to serve each page." - this can also be included in the logs if you use a custom log format, though you'll have to check the documentation for the log analysis tool you chose to see if it supports this extra information. Be careful when using this figure to infer things without other facts backing up the inference, as the time will obviously be affected by other load on the system as well as the load imposed by itself.
One of the most likely source of trouble in the circumstance you describe is the database. You don't state what the database server you are using is so we can't be more specific here, but you will find most databases allow logging of long running queries which you can use like the Apache "time taken" log field to infer places to look for optimisation opportunities. Specifically look for queries that perform table scans over large datasets.
The other main possibility is simply a glut of activity that your machine is not high spec enough to cope with - you should see this if it is the case using an Apache log analyser. If you get a sudden glut of traffic this can result in extra Apache processes getting launched and many extra database queries. In either case this can result in a lot of I/O activity either due to the database access or swapping if the extra processes push the machine past what can fit in RAM. It would be worth looking at memory use and swap activity during one of the busy spots, of if you can't catch one at the time it happens leave some logging in place so you can review what happened after the fact. I use collectd for such monitoring (there are other options around with similar features if collectd is not to your tastes), and as well as monitoring system params like CPU use, I/O and memory+swap use it also has modules for logging specific Apache and mySQL/postgres properties which you may find helpful. You state that you already have an I/O chart which implies a solution like this is already installed - you could check to see what other property logging options that has, specifically if it can distinguish between I/O to partitions where your data is from I/O caused by swap activity.
If gluts of activity are the issue then you you may find that you need either more RAM, a better I/O subsystem, or both, in order to serve the site's peak load - though there might be places in your code or database design where optimisation would help too, specifically look at improving the indexing of your data in the database if full table scans are being performed where they shouldn't be necessary) and you could consider certain caching techniques to reduce the number of times dynamic content is reconstructed from scratch.
There is a easy way to auto prepend and append scripts to all the php files: http://www.electrictoolbox.com/php-automatically-append-prepend/
Just start a timer in the prepended file, stop it and log the running time in the appended one.
By using this technique, you won't have to bother that other developers/designers overwrite your code.