I wish I could be more specific about this issue and I am really asking for suggestions on where to look next.
We are running a PHP web app that is integrated with WordPress and that is serving a page every second during some parts of the day. Generally things run very well on a single dedicated server (quad core with 16GB RAM).
I am considering New Relic and their tools alerted me to the occasional page load where PHP seems to stall. It reaches a seemingly arbitrary point in the call trace and then spends many seconds (or tens of seconds) on something trivial. The most trivial example was a function that was a conditional followed by echo().
There are no other errors (in the Apache/PHP error log) or slow queries I can see that coincide with these events. They also don't seem to coincide with any particularly heavy CPU load, disk I/O or network I/O.
It happens a few times an hour, and one thing in common between all the functions that stall is that they are doing output to the page. Could something be blocking the output buffer? Or are there any other obvious culprits that might be causing this issue? What would you do next to troubleshoot?
Linux: CentOS 5.6
Apache: 2.2.3
MYSQL: 5.5
PHP: 5.3.9
APC: 3.1.9 (cache is healthy, hit rate close to 100%)
The PHP FPM SAPI has a slow log, where scripts that take longer than n seconds can be logged with a traceback. Unfortunately no other SAPI has this functionality. (If you were already using nginx+PHP-FPM you would already have this! It's saved my bacon more than once.)
The fallback seems to be to run xdebug, but this can get hairy in a production environment. Or worse, to roll your own "debugging" scripts (see this question on Stack Overflow for bad examples).
This was a configuration issue as JakeGould suggested.
The solution was to raise the maximum TCP/IP write buffer size to 256k (was about 128k before) and specify that Apache use the maximum by using the SendBufferSize directive in the prefork module in httpd.conf. This applies to you if you are using prefork as your MPM. YMMV with other MPMs.
The reason this solves the problem:
We are sometimes serving large pages, and when the page size is larger than the send buffer, Apache sends the buffer when it is full and then waits for a response from the client before sending the rest. Even with good network connections this can introduce a second or two of delay, and with bad connections it can be far worse. I think for the truly pathological cases (reaching the 120s limit) the user has move away from the page in frustration after a few seconds so the server never gets a response.
By setting the send buffer larger than the size of most pages, Apache sends the full page and doesn't wait for a response.
Of course, the same applies for individual images you are serving. If they are larger than the send buffer then you hit this problem. Most of our images are served by a CDN, but some of the large ones that aren't on the CDN were triggering this issue.
Although this doesn't really help the clients with poor connections to our site, it has reduced our average response time from 200ms+ to < 100ms. Also, by clearing out all these slow transactions I can now see the real perf issues in the ones that remain.