This is my nginx.conf
(I've updated config to make sure that there is no PHP involved or any other bottlenecks):
user nginx;
worker_processes 4;
worker_rlimit_nofile 10240;
pid /var/run/nginx.pid;
events
{
worker_connections 1024;
}
http
{
include /etc/nginx/mime.types;
error_log /var/www/log/nginx_errors.log warn;
port_in_redirect off;
server_tokens off;
sendfile on;
gzip on;
client_max_body_size 200M;
map $scheme $php_https { default off; https on; }
index index.php;
client_body_timeout 60;
client_header_timeout 60;
keepalive_timeout 60 60;
send_timeout 60;
server
{
server_name dev.anuary.com;
root "/var/www/virtualhosts/dev.anuary.com";
}
}
I am using http://blitz.io/play to test my server (I bought the 10 000 concurrent connections plan). In a 30 seconds run, I get 964
hits and 5,587 timeouts
. The first timeout happened at 40.77 seconds into the test when the number of concurrent users was at 200.
During the test, the server load was (top
output):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20225 nginx 20 0 48140 6248 1672 S 16.0 0.0 0:21.68 nginx
1 root 20 0 19112 1444 1180 S 0.0 0.0 0:02.37 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/0
Therefore it is not server resource issue. What is it then?
UPDATE 2011 12 09 GMT 17:36.
So far I did the following changes to make sure that the bottleneck is not TCP/IP. Added to /etc/sysctl.conf
:
# These ensure that TIME_WAIT ports either get reused or closed fast.
net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_tw_recycle = 1
# TCP memory
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 4096
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
Some more debug info:
[root@server node]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 126767
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
NB That worker_rlimit_nofile
is set to 10240
nginx config.
UPDATE 2011 12 09 GMT 19:02.
It looks like the more changes I do, the worse it gets, but here the new config file.
user nginx;
worker_processes 4;
worker_rlimit_nofile 10240;
pid /var/run/nginx.pid;
events
{
worker_connections 2048;
#1,353 hits, 2,751 timeouts, 72 errors - Bummer. Try again?
#1,408 hits, 2,727 timeouts - Maybe you should increase the timeout?
}
http
{
include /etc/nginx/mime.types;
error_log /var/www/log/nginx_errors.log warn;
# http://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads/
access_log off;
open_file_cache max=1000;
open_file_cache_valid 30s;
client_body_buffer_size 10M;
client_max_body_size 200M;
proxy_buffers 256 4k;
fastcgi_buffers 256 4k;
keepalive_timeout 15 15;
client_body_timeout 60;
client_header_timeout 60;
send_timeout 60;
port_in_redirect off;
server_tokens off;
sendfile on;
gzip on;
gzip_buffers 256 4k;
gzip_comp_level 5;
gzip_disable "msie6";
map $scheme $php_https { default off; https on; }
index index.php;
server
{
server_name ~^www\.(?P<domain>.+);
rewrite ^ $scheme://$domain$request_uri? permanent;
}
include /etc/nginx/conf.d/virtual.conf;
}
UPDATE 2011 12 11 GMT 20:11.
This is output of netstat -ntla
during the test.
https://gist.github.com/d74750cceba4d08668ea
UPDATE 2011 12 12 GMT 10:54.
Just to clarify, the iptables
(firewall) is off while testing.
UPDATE 2011 12 12 GMT 22:47.
This is the sysctl -p | grep mem
dump.
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_mem = 8388608 8388608 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 65536 8388608
net.ipv4.route.flush = 1
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_max = 8388608
net.core.wmem_default = 65536
net.core.netdev_max_backlog = 262144
net.core.somaxconn = 4096
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
UPDATE 2011 12 12 GMT 22:49
I am using blitz.io
to run all the tests. The URL I am testing is http://dev.anuary.com/test.txt, using the following command: --region ireland --pattern 200-250:30 -T 1000 http://dev.anuary.com/test.txt
UPDATE 2011 12 13 GMT 13:33
nginx
user limits (set in /etc/security/limits.conf
).
nginx hard nofile 40000
nginx soft nofile 40000
You will need to dump your network connections during the test. While the server may have near zero load, your TCP/IP stack could be billing up. Look for TIME_WAIT connections in a netstat output.
If this is the case, then you will want to check into tuning tcp/ip kernel paramters relating to TCP Wait states, TCP recyling, and similar metrics.
Also, you have not described what is being tested.
I always test:
This may not apply in your case but is something I do when performance testing. Testing different types of files can help you pinpoint the bottlneck.
Even with static content, testing different size of files is important as well to get timeouts and other metrics dialed in.
We have some static content Nginx boxes handling 3000+ active connections. So it Nginx can certainly do it.
Update: Your netstat shows a lot of open connections. May want to try tuning your TCP/IP stack. Also, what file are you requesting? Nginx should quickly close the port.
Here is a suggestion for sysctl.conf:
These values are very low but I have had success with them on high concurrency Nginx boxes.
Yet another hypothesis. You have increased
worker_rlimit_nofile
, but the maximum number of clients is defined in the documentation asmax_clients = worker_processes * worker_connections
What if you try to raise
worker_connections
to, like, 8192? Or, if there's enough CPU cores, increaseworker_processes
?I was having a very similar issue with a nginx box serving as load balancer with an upstream of apache servers.
In my case I was able to isolate the problem to be networking related as the upstream apache servers became overloaded. I could recreate it with simple bash scripts while the overall system was under load. According to an strace of one of the hung processes the connect call was getting an ETIMEDOUT.
These settings (on the nginx and upstream servers) eliminated the problem for me. I was getting 1 or 2 timeouts per minute before making these changes (boxes handling ~100 reqs/s) and now get 0.
I would not recommend using net.ipv4.tcp_tw_recycle or net.ipv4.tcp_tw_reuse, but if you want to use one go with the latter. They can cause bizarre issues if there is any kind of latency at all and the latter is at least the safer of the two.
I think having tcp_fin_timeout set to 1 above may be causing some trouble as well. Try putting it at 20/30 - still far below the default.
maybe is not nginx problem, while you test on blitz.io do a :
(thats what i am using to handle the php)
this triggers an error and the timeouts starts to raise:
so, put more max_children on fmp conf and its done! ;D
You have too low
max open files
(1024), try change and restart nginx. (cat /proc/<nginx>/limits
to confirm)And increase
worker_connections
to 10240 or higher.