I'm investigating replacing a proprietary, software load balancer with HAProxy. As part of this investigation, I am attempting to test HAProxy under load. While my HAProxy configuration works fine when testing it as a single user, as soon as I put any load on it, the speed of the site starts to drop dramatically, and before long (~ 100 simulated users) our load testing tool starts to report failures.
It's quite a straight forward configuration, with only notable points being we're using HAProxy 1.5.4 with OpenSSL and PCRE support compiled in and used. We also have some ACLs to match on URLs, although that frontend isn't being used in this load test.
This is running on a CentOS 6.5 machine.
Our (sanitised) configuration for the frontend/backend combination in the load test, along with global and defaults:
global
daemon
tune.ssl.default-dh-param 2048
maxconn 100000
maxsessrate 100000
log /dev/log local6
defaults
mode http
option forwardfor
option http-server-close
timeout client 61s
timeout server 61s
timeout connect 13s
log global
option httplog
frontend stats
bind xxx.xxx.xxx.xxx:80
default_backend stats-backend
backend stats-backend
stats enable
server stats 127.0.0.1:80
frontend portal-frontend
bind xxx.xxx.xxx.xxx:80
default_backend portal-backend
frontend portal-frontend-https
bind xxx.xxx.xxx.xxx:443 ssl crt /path/to/pem
default_backend portal-backend
backend portal-backend
redirect scheme https if !{ ssl_fc }
appsession session len 140 timeout 4h request-learn
server web1.example.com web1.example.com:80 check
server web2.example.com web2.example.com:80 check
[...snip...]
During the load test, we're getting some information from the logs, but not huge amounts. Relevant snippets:
Sep 4 11:06:12 xxxx haproxy[15609]: xxx.xxx.xxx.xxx:30983 [04/Sep/2014:11:05:42.984] portal-frontend-https~ portal-frontend-https/<NOSRV> -1/-1/-1/-1/28782 408 212 - - cR-- 1840/1840/0/0/0 0/0 "<BADREQ>"
...
Sep 4 11:06:03 xxxx haproxy[15609]: xxx.xxx.xxx.xxx:61502 [04/Sep/2014:11:05:47.810] portal-frontend-https~ portal-frontend-https/<NOSRV> -1/-1/-1/-1/14345 400 187 - - CR-- 1715/1693/0/0/0 0/0 "<BADREQ>"
...
Sep 4 11:06:03 xxxx haproxy[15609]: xxx.xxx.xxx.xxx:43939 [04/Sep/2014:11:05:59.553] portal-frontend portal-backend/<NOSRV> 314/-1/-1/-1/2602 302 181 - - LR-- 1719/22/223/0/3 0/0 "GET /mon/login.php?C=1&LID=15576783&TID=8145&PID=8802 HTTP/1.1"
On the basis of these log entries, we've tried things like adjusting timeout http-request, but without any improvement (the load test will run for longer before failures are reported by our tool, but the slow down occurs in a similar way).
I'm confident HAProxy is capable of doing far better than this, but I really don't know where to turn now to start diagnosing what the problem (or limitation) is.
Please run dmesg and ensure your iptables' conntrack table is not full... You may have many messages like this one: "ip_conntrack: table full, dropping packet"
If so, tune your sysctl: net.ipv4.netfilter.ip_conntrack_max Default value is very low. You can set it up to 50000, maybe more, depends on your workload.
Baptiste
Felix is right. You need maxconn on your back end servers set low and your global maxconn is way to high. Put that to something like 4000.
It is critical that you understand the difference in global and server maxconn.
Willy Tarreau (HAProxy's author) describes is very clearly here: https://stackoverflow.com/questions/8750518/difference-between-global-maxconn-and-server-maxconn-haproxy
I have been using HAProxy for years and my default is 64 maxcon on back end servers.
HAProxy is very high performance and is certainly capable of overloading a web server if mis-configured. Take a look at the webservers' network connections and error logs to see if their hitting max connections. I would not be surprised if that is the case.