Recently, I have changed from Apache mpm-prefork (PHP module) to mpm-worker (PHP-FPM) due to memory issues. I am running a quite-large PHP application that requires ~20-30M per prefork process.
Overall, the server runs stable and fast. However, from time to time, the page is unavailable to some users for a few minutes.
Working hypothesis 1 (=rough idea) is that one of the processes (usually 2, sometime up to 5 or 6) hangs and each client assigned to this process (e.g. 50% of the clients) receives an error message.
Working hypothesis 2 is that MaxRequestsPerProcess is responsible. After 500 calls, the process tries to shut down, mod_fcgid does not gracefully kill and while the process is waiting for the kill, further clients are assigned to (and rejected by) the process. But I cannot really imaging that Apache would be so stupid.
My problem is: There is nothing in the error logs except some
[warn] mod_fcgid: process ???? graceful kill fail, sending SIGKILL
I am running out of ideas where to trace the problem. It appears sporadically and I have not yet managed to provoke it. Server performance (CPU/RAM) shall not be an issue, as the overall load has been in the lower range the recent weeks.
Thanks for any hints. Any comments on my hypotheses (that did not help my to find a solution, yet - I tried to disable the MaxRequestsPerProcess but do not yet know if it helped)? I would greatly appreciate some ideas how to trace this problem.
Apache configuration
<Directory /var/www/html>
...
# PHP FCGI
<FilesMatch \.php$>
SetHandler fcgid-script
</FilesMatch>
Options +ExecCGI
</Directory>
<IfModule mod_fcgid.c>
FcgidWrapper /var/www/php-fcgi-starter .php
# Allow request up to 33 MB
FcgidMaxRequestLen 34603008
FcgidIOTimeout 300
FcgidBusyTimeout 3600
# Set 1200 (>1000) for PHP_FCGI_MAX_REQUESTS to avoid problems
FcgidMaxRequestsPerProcess 1000
</IfModule>
Apache module configuration
<IfModule mod_fcgid.c>
AddHandler fcgid-script .fcgi
FcgidConnectTimeout 20
FcgidBusyTimeout 7200
DefaultMinClassProcessCount 0
IdleTimeout 600
IdleScanInterval 60
MaxProcessCount 20
MaxRequestsPerProcess 500
PHP_Fix_Pathinfo_Enable 1
</IfModule>
Note: The timeout was set to 2 hours because rarely, the application may require some time to run (e.g. the nightly cronjob that does a database optimization).
Starter script
#!/bin/sh
PHP_FCGI_MAX_REQUESTS=1200
export PHP_FCGI_MAX_REQUESTS
export PHPRC="/etc/php5/cgi"
exec /usr/bin/php5-cgi
#PHP_FCGI_CHILDREN=10
#export PHP_FCGI_CHILDREN
Package versions
- System: Ubuntu 12.04.2 LTS
- apache2-mpm-worker: 2.2.22-1ubuntu1.4
- libapache2-mod-fcgid: 1:2.3.6-1.1
- php5-common: 5.3.10-1ubuntu3.7
I'd regard 20-30MB per process as quite small. It's all relative really, but for example most CMS applications will require at least 100MB. Also your maximum upload size will be constrained by the maximum process size if that matters.
When your server is unavailable, it's likely that the php worker processes are all busy, however that's only a proximate cause. Something is slowing down your server such that for a while at least, the php processes can't keep up with the incoming requests. What is slowing down your server is hard to judge, but the 'graceful kill fail' makes me think the process that was to be killed is likely waiting on disk.
Have you logged in while this is happening? Does the system feel responsive?
In top, look at the process states, and look for the 'D' ones, which are waiting on IO. Are there many of these? The 'wa' in the summary up the top is the total amount of time that processes spend waiting on IO. (It says percent, but that's likely a percentage of one processor's time). Tools like iotop, atop, and vmstat may also be useful for getting a view on what processes are disk bound, and the extent to which the disk is limiting your overall performance.
Your understanding of what happens when a worker process is not available to take new requests is incorrect. New requests will not be assigned to it.
1000 requests before killing the worker is high. I'd suggest dropping it to somewhere between 10 and 50.
I think you're on the right track with Hypopthesis 1. mc0e's advice is pretty solid, so I'm mostly adding to it.
Those log messages that you're seeing suggest that individual processes are locking up under the prefork MPM, which gives you much better process isolation than worker. I've seen this in a production environment before and it means that you have some misbehaving code.
Between your high max requests per child and your hanging processes, this sets the stage for memory bloat. The documentation specifically covers the fact that a non-zero value helps to protect against memory leaks, but if you set that value too high the benefits are lost. Having your processes hang on top of that just further compounds the overall memory footprint.
This leaves you with two immediate takeaways:
MaxRequestsPerChild
by a significant margin, as mc0e was suggesting. This helps to prevent the individual processes from living long enough to accumulate significant memory leaks...but as he said, 20-30M probably isn't that big of a deal.lsof
on your large processes may provide a hint depending on what the code is doing (i.e. file handle leakage, and hitting the max file handle ceiling may be related to the process deadlocks), but otherwise you're looking at code debugging.