Ping a Specific Port

Question

BurninLeo

Asked: 2013-07-30 04:13:26 +0800 CST2013-07-30 04:13:26 +0800 CST 2013-07-30 04:13:26 +0800 CST

Apache mpm-worker + mod_fcgid + php5_cgi partially and sporadically down

772

Recently, I have changed from Apache mpm-prefork (PHP module) to mpm-worker (PHP-FPM) due to memory issues. I am running a quite-large PHP application that requires ~20-30M per prefork process.

Overall, the server runs stable and fast. However, from time to time, the page is unavailable to some users for a few minutes.

Working hypothesis 1 (=rough idea) is that one of the processes (usually 2, sometime up to 5 or 6) hangs and each client assigned to this process (e.g. 50% of the clients) receives an error message.

Working hypothesis 2 is that MaxRequestsPerProcess is responsible. After 500 calls, the process tries to shut down, mod_fcgid does not gracefully kill and while the process is waiting for the kill, further clients are assigned to (and rejected by) the process. But I cannot really imaging that Apache would be so stupid.

My problem is: There is nothing in the error logs except some

[warn] mod_fcgid: process ???? graceful kill fail, sending SIGKILL

I am running out of ideas where to trace the problem. It appears sporadically and I have not yet managed to provoke it. Server performance (CPU/RAM) shall not be an issue, as the overall load has been in the lower range the recent weeks.

Thanks for any hints. Any comments on my hypotheses (that did not help my to find a solution, yet - I tried to disable the MaxRequestsPerProcess but do not yet know if it helped)? I would greatly appreciate some ideas how to trace this problem.

Apache configuration

    <Directory /var/www/html>
           ...

            # PHP FCGI
            <FilesMatch \.php$>
                    SetHandler fcgid-script
            </FilesMatch>
            Options +ExecCGI
    </Directory>

    <IfModule mod_fcgid.c>
            FcgidWrapper /var/www/php-fcgi-starter .php
            # Allow request up to 33 MB
            FcgidMaxRequestLen 34603008
            FcgidIOTimeout 300
            FcgidBusyTimeout 3600
            # Set 1200 (>1000) for PHP_FCGI_MAX_REQUESTS to avoid problems
            FcgidMaxRequestsPerProcess 1000
    </IfModule>

Apache module configuration

<IfModule mod_fcgid.c>
  AddHandler    fcgid-script .fcgi
  FcgidConnectTimeout 20
  FcgidBusyTimeout 7200

  DefaultMinClassProcessCount 0
  IdleTimeout 600
  IdleScanInterval 60
  MaxProcessCount 20

  MaxRequestsPerProcess 500
  PHP_Fix_Pathinfo_Enable 1
</IfModule>

Note: The timeout was set to 2 hours because rarely, the application may require some time to run (e.g. the nightly cronjob that does a database optimization).

Starter script

#!/bin/sh
PHP_FCGI_MAX_REQUESTS=1200
export PHP_FCGI_MAX_REQUESTS

export PHPRC="/etc/php5/cgi"
exec /usr/bin/php5-cgi

#PHP_FCGI_CHILDREN=10
#export PHP_FCGI_CHILDREN

Package versions

System: Ubuntu 12.04.2 LTS
apache2-mpm-worker: 2.2.22-1ubuntu1.4
libapache2-mod-fcgid: 1:2.3.6-1.1
php5-common: 5.3.10-1ubuntu3.7

2 Answers

Voted

mc0e · Answer 1 · 2013-08-15T05:29:47+08:00

I'd regard 20-30MB per process as quite small. It's all relative really, but for example most CMS applications will require at least 100MB. Also your maximum upload size will be constrained by the maximum process size if that matters.

When your server is unavailable, it's likely that the php worker processes are all busy, however that's only a proximate cause. Something is slowing down your server such that for a while at least, the php processes can't keep up with the incoming requests. What is slowing down your server is hard to judge, but the 'graceful kill fail' makes me think the process that was to be killed is likely waiting on disk.

Have you logged in while this is happening? Does the system feel responsive?

In top, look at the process states, and look for the 'D' ones, which are waiting on IO. Are there many of these? The 'wa' in the summary up the top is the total amount of time that processes spend waiting on IO. (It says percent, but that's likely a percentage of one processor's time). Tools like iotop, atop, and vmstat may also be useful for getting a view on what processes are disk bound, and the extent to which the disk is limiting your overall performance.

Your understanding of what happens when a worker process is not available to take new requests is incorrect. New requests will not be assigned to it.

1000 requests before killing the worker is high. I'd suggest dropping it to somewhere between 10 and 50.

Andrew B · Answer 2 · 2013-08-15T05:50:15+08:00

I think you're on the right track with Hypopthesis 1. mc0e's advice is pretty solid, so I'm mostly adding to it.

Those log messages that you're seeing suggest that individual processes are locking up under the prefork MPM, which gives you much better process isolation than worker. I've seen this in a production environment before and it means that you have some misbehaving code.

Between your high max requests per child and your hanging processes, this sets the stage for memory bloat. The documentation specifically covers the fact that a non-zero value helps to protect against memory leaks, but if you set that value too high the benefits are lost. Having your processes hang on top of that just further compounds the overall memory footprint.

This leaves you with two immediate takeaways:

Lower MaxRequestsPerChild by a significant margin, as mc0e was suggesting. This helps to prevent the individual processes from living long enough to accumulate significant memory leaks...but as he said, 20-30M probably isn't that big of a deal.
Find your bugs. You're looking for memory leaks and execution deadlocks (resource contention as mc0e was suggesting, but also take a look at what your code does when network resources become unreachable or nonresponsive). Running lsof on your large processes may provide a hint depending on what the code is doing (i.e. file handle leakage, and hitting the max file handle ceiling may be related to the process deadlocks), but otherwise you're looking at code debugging.

Apache mpm-worker + mod_fcgid + php5_cgi partially and sporadically down

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?