I have several servers serving a single site.
Main server runs nginx and php-fpm. And all the other servers run php-fpm. The server that runs both nginx and php-fpm connects via a unix socket and the others via tcp.
Roughly once an hour (not exactly, sometimes more frequent), there's a strange behavior. All connection of nginx to php-fpm servers timeout. It fails to make a connection.
2014/03/24 04:59:09 [error] 2123#0: *925153 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *926742 connect() to unix:/tmp/php-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-fpm.sock:", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *925159 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.2:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *923874 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *925164 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *909392 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *923098 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2125#0: *923309 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
As this is a fairly busy site, the log like above gets populated quite fast.
This lasts for roughly 10~15 seconds and everything goes back to normal. Besides the connection timed out errors posted here, there doesn't seem to be any other errors.
I suspect the problem lies with nginx since it happens simultaneously across all the php-fpm servers.
What would cause this? And how could this be resolved?
My nginx config is...
user nginx;
worker_processes 4;
worker_rlimit_nofile 30000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 4096;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
keepalive_timeout 5;
fastcgi_buffers 256 4k;
gzip on;
gzip_disable "msie6";
fastcgi_cache_path /dev/shm/caches/ levels=1:2 keys_zone=zoneone:4000m max_size=4000m inactive=30m;
fastcgi_temp_path /var/www/tmp 1 2;
fastcgi_cache_key "$scheme$proxy_host$request_uri";
fastcgi_connect_timeout 3s;
limit_req_zone $binary_remote_addr zone=limitone:200m rate=1r/s;
limit_req_zone $binary_remote_addr zone=limitcomic:500m rate=40r/m;
upstream partone {
server unix:/tmp/php-fpm.sock;
}
upstream parttwo {
server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s;
}
upstream parttre {
server 192.168.1.2:9000 weight=8 max_fails=0 fail_timeout=2s;
server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s;
}
... stuff with server, locations and such...
}
You can see that I don't even use all 5 servers in the same context.
nginx version: nginx/1.4.5
This is an educated guess. The problem could be caused by exhaustion of local TCP ports for connections to the upstream servers.
You can check the range of allowed ports with:
The default on my Debian installation is 32768 - 61000.
You can expand the range with entering the following command as root:
If you are running a Debian or derived distribution, you can persist this setting across reboots by editing
/etc/sysctl.d/99-local.conf
and entering this into the file: