We had never any problems with nginx. We use 5 nginx server as loadbalancers in front of many spring boot application servers.
We were running them for years on debian 9 with the default nginx package 1.10.3. Now we switched three of our loadbalancers to debian 10 with nginx 1.14.2. First everything runs smoothly. Then, on high load we encountered some problems. It starts with
2020/02/01 17:10:55 [crit] 5901#5901: *3325390 SSL_write() failed while sending to client, client: ...
2020/02/01 17:10:55 [crit] 5901#5901: *3306981 SSL_write() failed while sending to client, client: ...
In between we get lots of
2020/02/01 17:11:04 [error] 5902#5902: *3318748 upstream timed out (110: Connection timed out) while connecting to upstream, ...
2020/02/01 17:11:04 [crit] 5902#5902: *3305656 SSL_write() failed while sending response to client, client: ...
2020/02/01 17:11:30 [error] 5911#5911: unexpected response for ocsp.int-x3.letsencrypt.org
It ends with
2020/02/01 17:11:33 [error] 5952#5952: unexpected response for ocsp.int-x3.letsencrypt.org
The problem does only exits for 30-120 seconds on high load and disappears afterwards.
In the kernel log we have sometimes: Feb 1 17:11:04 kt104 kernel: [1033003.285044] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies. Check SNMP counters.
But on other occasions we don't see any kernel.log messages
On both debian 9 and debian 10 servers we use the identical setup and had some TCP Tuning in place:
# Kernel tuning settings
# https://www.nginx.com/blog/tuning-nginx/
net.core.rmem_max=26214400
net.core.wmem_max=26214400
net.ipv4.tcp_rmem=4096 524288 26214400
net.ipv4.tcp_wmem=4096 524288 26214400
net.core.somaxconn=1000
net.core.netdev_max_backlog=5000
net.ipv4.tcp_max_syn_backlog=10000
net.ipv4.ip_local_port_range=16000 61000
net.ipv4.tcp_max_tw_buckets=2000000
net.ipv4.tcp_fin_timeout=30
net.core.optmem_max=20480
The nginx config is exactly the same, so I just show the main file:
user www-data;
worker_processes auto;
worker_rlimit_nofile 50000;
pid /run/nginx.pid;
events {
worker_connections 5000;
multi_accept on;
use epoll;
}
http {
root /var/www/loadbalancer;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
types_hash_max_size 2048;
server_tokens off;
client_max_body_size 5m;
client_header_timeout 20s; # default 60s
client_body_timeout 20s; # default 60s
send_timeout 20s; # default 60s
include /etc/nginx/mime.types;
default_type application/octet-stream;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
ssl_session_timeout 1d;
ssl_session_cache shared:SSL:100m;
ssl_buffer_size 4k;
ssl_dhparam /etc/nginx/dhparam.pem;
ssl_prefer_server_ciphers on;
ssl_ciphers 'ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS';
ssl_session_tickets on;
ssl_session_ticket_key /etc/nginx/ssl_session_ticket.key;
ssl_session_ticket_key /etc/nginx/ssl_session_ticket_old.key;
ssl_stapling on;
ssl_stapling_verify on;
ssl_trusted_certificate /etc/ssl/rapidssl/intermediate-root.pem;
resolver 8.8.8.8;
log_format custom '$host $server_port $request_time $upstream_response_time $remote_addr "$ssl_session_reused" $upstream_addr $time_iso8601 "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent";
access_log /var/log/nginx/access.log custom;
error_log /var/log/nginx/error.log;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_path /var/cache/nginx/ levels=1:2 keys_zone=imagecache:10m inactive=7d use_temp_path=off;
proxy_connect_timeout 10s;
proxy_read_timeout 20s;
proxy_send_timeout 20s;
proxy_next_upstream off;
map $http_user_agent $outdated {
default 0;
"~MSIE [1-6]\." 1;
"~Mozilla.*Firefox/[1-9]\." 1;
"~Opera.*Version/[0-9]\." 1;
"~Chrome/[0-9]\." 1;
}
include sites/*.conf;
}
The upstream timeout signals some problems with our java machines. But at the same time the debian9 nginx/loadbalancer is running fine and has no problems connecting to any of the upstream servers. And the problems with letsencrypt and SSL_write are signaling to me some problems with nginx or TCP or whatever. I really don't know how to debug this situation. But we can reliable reproduce it most of the times we encounter high load on debian10 servers and did never see it on debian 9.
Then I installed the stable version nginx 1.16 on debian10 to see if this is a bug in nginx which is already fixed:
nginx version: nginx/1.16.1
built by gcc 8.3.0 (Debian 8.3.0-6)
built with OpenSSL 1.1.1c 28 May 2019 (running with OpenSSL 1.1.1d 10 Sep 2019)
TLS SNI support enabled
configure arguments: ...
But it didn't help.
It seems to be a network related problem. But we do not encouter it on the application servers. But the load is of course lower as the loadbalancer/nginx machine has to handle external and internal traffic.
It is very difficult to debug as it only happens on high load. We treid to load test the servers with ab, but we could not reproduce the problem.
Can somebody help me and give me some hints how to start further debugging of this situation?
accept_mutex changed from on to off in the default value. Setting it back to "on" and Nginx is happily running with 10k requests per second again. I guess it is a combination of multi_accept and accept_mutex which caused my troubles.
These settings are not recommended by the way and we changed to a more modern setting with reuseport etc. Follow the guidelines on the Nginx blog for your own setup. Nginx is great.