My nginx keep crashing and reporting "bad gateway" errors in the browser. Nginx and PHP-FPM don't come preconfigured to handle large traffic loads. I had to put a systemctl restart php7.0-fpm
cron job in place each hour just to make sure my sites don't stay down for too long when they go. Let's just get down to it.
Some errors I get from /var/log/php7.0-fpm.log
:
[20-Sep-2017 12:08:21] NOTICE: [pool web3] child 3495 started
[20-Sep-2017 12:08:21] NOTICE: [pool web3] child 2642 exited with code 0 after 499.814492 seconds from start
[20-Sep-2017 12:32:28] WARNING: [pool web3] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 7 idle, and 57 total children
Nothing jumps out at me in the nginx log. If I leave it running for too long without restarting it (PHP-FPM), I will get gateway errors. I've tried following tutorials 3 times now tweaking settings but it's still no good. Right now I've got all kinds of settings probably way off but it never works either way I do it.
/etc/nginx/nginx.conf
:
user www-data;
worker_processes auto;
pid /run/nginx.pid;
worker_rlimit_nofile 100000;
events {
worker_connections 4096;
use epoll;
multi_accept on;
}
http {
sendfile on;
reset_timedout_connection on;
client_body_timeout 10;
send_timeout 2;
keepalive_timeout 30;
keepalive_requests 100000;
tcp_nopush on;
tcp_nodelay on;
types_hash_max_size 2048;
fastcgi_read_timeout 300000;
client_max_body_size 9000m;
include /etc/nginx/mime.types;
default_type application/octet-stream;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
ssl_prefer_server_ciphers on;
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
gzip on;
gzip_disable "msie6";
gzip_vary on;
gzip_proxied any;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_http_version 1.1;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
open_file_cache max=200000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
access_log off;
}
/etc/php/7.0/fpm/php-fpm.conf
:
[www]
pm = dynamic
pm.max_spare_servers = 200
pm.min_spare_servers = 100
pm.start_servers = 100
pm.max_children = 300
[global]
pid = /run/php/php7.0-fpm.pid
error_log = /var/log/php7.0-fpm.log
include=/etc/php/7.0/fpm/pool.d/*.conf
/etc/php/7.0/fpm/pool.d/www.conf
:
[www]
user = www-data
group = www-data
listen = /run/php/php7.0-fpm.sock
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 300
pm.start_servers = 100
pm.min_spare_servers = 100
pm.max_spare_servers = 200
pm.max_requests = 500
One of my sites (/etc/php/7.0/fpm/pool.d/web3.conf
):
[web3]
listen = /var/lib/php7.0-fpm/web3.sock
listen.owner = web3
listen.group = www-data
listen.mode = 0660
user = web3
group = client1
pm = dynamic
pm.max_children = 141
pm.start_servers = 20
pm.min_spare_servers = 20
pm.max_spare_servers = 35
pm.max_requests = 500
chdir = /
env[HOSTNAME] = $HOSTNAME
env[TMP] = /var/www/clients/client1/web3/tmp
env[TMPDIR] = /var/www/clients/client1/web3/tmp
env[TEMP] = /var/www/clients/client1/web3/tmp
env[PATH] = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Resource/proc usage from htop:
The issue is with your database access. You have several MySQL processes using CPU, which indicates that database queries take long to execute.
You need to look into your application, looking for the following things:
The slow database queries then cause PHP-FPM to run out of available child processes which process the client requests. This will cause
502 Bad Gateway
errors. You can try to increasepm.max_children
setting forweb3
pool, since that is causing the errors. This can remove scalability symptoms, but does not fix the root cause which is application / database inefficiency.If you are not using the
www
pool, you can remove it to save the resources it uses.The ideal setting for
pm.max_requests
is zero, that is, PHP workers should never be restarted. If your PHP workers don't leak memory due to bad coding of libraries, then you can use zero over there. Otherwise you can use whichever value that keeps the memory usage of the workers decent. There really isn't any other good advice to give regarding this setting.There isn't that much you can do with nginx settings here, since it is the PHP-FPM that is not available sometimes. You could change
gzip_comp_level
to1
, which makes nginx spend a little less CPU compressing output. But this has really small effect compared to application optimisation.(this should be a comment, but its a bit long)
....is not a capacity issue unless your server is so badly configured that the oom killer is kicking in. And is not the error you've quoted from your logs.
Why do you have half a gig of swap on a box with 12 gig of RAM?
Your keepalive is too high.
You have disabled access logging (your logs are the place to start looking for capacity issues).
The top output hints at problems with mysql performance.
Your pm.max_requests is too low.
You've not capped the listen_backlog.
Everything you've shown us here has issues and its just the tip of the iceberg. Voting to close
Is it the web3 site that is going offline? This log entry seems to be suggesting the cause:
You've got really high values for start_servers / max_spare_servers for the www site, but much lower values for web3.
You don't seem to be out of memory, so giving mysql more may help. Unless your php app never queries mysql, leaving mysql out of your optimization process is a mistake.
To start, you'll want to look at your mysql config. I believe most distributions are fairly conservative in memory setup, and number of threads. Look for the mysql example configs, eg: my-large.cnf my-medium.cnf and compare them to yours. Debian based distros have them in /usr/share/doc/mysql-server-x.y/examples/ (where x.y is the major version)
When adjusting the various knobs, I'd recommend small adjustments. For example, change a value from 8M to 16M.
If it's your php app, you'll also want to look at slow query log as suggested by Tero Kilkanen's answer.
Hope that helps.
In my experience especially with a large site is that php-fpm uses alot of processor power. this happens if there is no cache available and it has to wait for your page to load and render locally and then cache it then server the cache. I've had the same issues with a large sites before. the best thing to do is use httrack to crawl your site, set speed limits in httrack so not to overload your server. This will build your nginx cache then once the cache is built then you will see instant loading of pages and very little cpu or ram usage. the main cause really is down to page rendering that can be caused by to much JS or CSS or most likely to many SQL requests or a poorly configured sql database. make sure to index database tables that are used frequently.
htop appears to indicate each of the 15 PID's that are MySQL associated have used TIME of more than 1:nn.nn and each has at least 1G of VIRT RAM in use. Since you have 12 GB RAM in total, is it time for you to share with us your
to allow some reasonable checks on your MySQL configuration, even though it is not a problem? Uptime of 1 day, 11 hours is encouraging.
Any idea what the PID 6148 was doing that has TIME of 28:+ invested in the effort?
From an earlier response today of @xendi .... "Whenever this happens, all pages on all sites, no matter what scripts or content, error out with the gateway error. This happens to all pages and sites"
have you looked at php.ini session.gc_maxlifetime = nnnn garbage collection seconds as being a possible cause?
09/24/2017 nginx.conf questions that may have an impact
possibly a helpful link.
The seems to be all about the memory.
Try to decrease the number of php servers and limit the memory of php and mysql server.