Looking at the output of top
I notice that intermittently there are one or two Apache processes consuming a high amount of CPU - anywhere between 50% - 90%
The spikes in CPU usage come and go quite quickly every 10 seconds or so.
There are various other Apache processes running which consume somewhere between 2% - 4%
I've researched all the various ways of trying to track down which virtualhost/website is responsible for these processes. However, because they come and go quickly I can't find a reliable way of doing this.
I've tried lsof
and also looking at the output of server-status
but because the processes don't last long the process ID gets re-used and it's not possible to tie it down to the virtualhost that's causing the issue.
For example, if I run lsof
on the process ID in question, it lists a dozen different virtualhost log files which have shared that process ID in the last few seconds. I'm convinced there is one virtualhost at fault but I can't figure out which one.
I've also checked the MySQL slow query log and this doesn't reveal anything of interest.
My recommendation: add response time to your logs.
It's not perfect, as there's no guarantee that the spike-causing requests take longer to serve than others, but it is likely, and gives you a starting point for investigation.
To do this, you'll want to define a new LogFormat and CustomLog which includes the %D parameter. See the Apache mod_log_config documentation.
Another option which is probably a bit too low-level but could give you an idea of the nature of the load, would be to strace the apache parent process with -f to follow children, and -c to show the cpu time per-call, e.g.
strace -f -c -p <apache parent pid>
Once you know the system calls that are taking the most time, you can then trace them directly. For example, say the server is spending a lot of time doing write(), you could then do
strace -f -e trace=write -p <apache parent pid>
, and look at those calls in more detail.