We recently had a spike in traffic, which while only moderate in size, caused haproxy to max out one of the CPU cores (and the server became unresponsive). I'm guessing that I'm doing something inefficiently the config, and so would like to ask all the haproxy experts out there if they would be so kind as to critique my config file below (mainly from a performance perspective).
The config is intended to distribute between a group of http-application servers, a group of servers that handle websockets connections (with a number of seperate processes on different ports), and a static file webserver. It's working well appart from the performance issue. (Some details have been redacted.)
Any guidance you could offer would be much appreciated!
HAProxy v1.4.8
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
daemon
maxconn 100000
log 127.0.0.1 local0 notice
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
log global
mode http
option httplog
option httpclose #http://serverfault.com/a/104782/52811
timeout connect 5000ms
timeout client 50000ms
timeout server 5h #long timeouts to stop WS drops - when v1.5 is stable, use 'timeout tunnel';
#---------------------------------------------------------------------
# FRONTEND
#---------------------------------------------------------------------
frontend public
bind *:80
maxconn 100000
reqidel ^X-Forwarded-For:.* #Remove any x-forwarded-for headers
option forwardfor #Set the forwarded for header (needs option httpclose)
default_backend app
redirect prefix http://xxxxxxxxxxxxxxxxx code 301 if { hdr(host) -i www.xxxxxxxxxxxxxxxxxxx }
timeout client 5h #long timeouts to stop WS drops - when v1.5 is stable, use 'timeout tunnel';
# ACLs
##########
acl static_request hdr_beg(host) -i i.
acl static_request hdr_beg(host) -i static.
acl static_request path_beg /favicon.ico /robots.txt
acl test_request hdr_beg(host) -i test.
acl ws_request hdr_beg(host) -i ws
# ws11
acl ws11x1_request hdr_beg(host) -i ws11x1
acl ws11x2_request hdr_beg(host) -i ws11x2
acl ws11x3_request hdr_beg(host) -i ws11x3
acl ws11x4_request hdr_beg(host) -i ws11x4
acl ws11x5_request hdr_beg(host) -i ws11x5
acl ws11x6_request hdr_beg(host) -i ws11x6
# ws12
acl ws12x1_request hdr_beg(host) -i ws12x1
acl ws12x2_request hdr_beg(host) -i ws12x2
acl ws12x3_request hdr_beg(host) -i ws12x3
acl ws12x4_request hdr_beg(host) -i ws12x4
acl ws12x5_request hdr_beg(host) -i ws12x5
acl ws12x6_request hdr_beg(host) -i ws12x6
# Which backend....
###################
use_backend static if static_request
#ws11
use_backend ws11x1 if ws11x1_request
use_backend ws11x2 if ws11x2_request
use_backend ws11x3 if ws11x3_request
use_backend ws11x4 if ws11x4_request
use_backend ws11x5 if ws11x5_request
use_backend ws11x6 if ws11x6_request
#ws12
use_backend ws12x1 if ws12x1_request
use_backend ws12x2 if ws12x2_request
use_backend ws12x3 if ws12x3_request
use_backend ws12x4 if ws12x4_request
use_backend ws12x5 if ws12x5_request
use_backend ws12x6 if ws12x6_request
#---------------------------------------------------------------------
# BACKEND - APP
#---------------------------------------------------------------------
backend app
timeout server 50000ms #To counter the WS default
mode http
balance roundrobin
option httpchk HEAD /upchk.txt
server app1 app1:8000 maxconn 100000 check
server app2 app2:8000 maxconn 100000 check
server app3 app3:8000 maxconn 100000 check
server app4 app4:8000 maxconn 100000 check
#---------------------------------------------------------------------
# BACKENDs - WS
#---------------------------------------------------------------------
#Server ws11
backend ws11x1
server ws11 ws11:8001 maxconn 100000
backend ws11x2
server ws11 ws11:8002 maxconn 100000
backend ws11x3
server ws11 ws11:8003 maxconn 100000
backend ws11x4
server ws11 ws11:8004 maxconn 100000
backend ws11x5
server ws11 ws11:8005 maxconn 100000
backend ws11x6
server ws11 ws11:8006 maxconn 100000
#Server ws12
backend ws12x1
server ws12 ws12:8001 maxconn 100000
backend ws12x2
server ws12 ws12:8002 maxconn 100000
backend ws12x3
server ws12 ws12:8003 maxconn 100000
backend ws12x4
server ws12 ws12:8004 maxconn 100000
backend ws12x5
server ws12 ws12:8005 maxconn 100000
backend ws12x6
server ws12 ws12:8006 maxconn 100000
#---------------------------------------------------------------------
# BACKEND - STATIC
#---------------------------------------------------------------------
backend static
server static1 static1:80 maxconn 40000
100,000 connections is a lot... Are you pushing that much? If so... maybe splitting the frontend such that it binds on one ip for static content and one ip for app content and then run the static and app variants as separate haproxy processes (assuming you have a second core / cpu on the server)...
If nothing else it will narrow the usage down to the app or static flows...
If I'm remembering my networking 101 class correctly... HaProxy shouldn't be able to hit
100,000
connections tows12:8001
or any other backend host:port because of the ~65536 port limit which is closer to28232
on most systems (cat /proc/sys/net/ipv4/ip_local_port_range
). You may be exhausting the local ports which could in turn cause the cpu to hang as it waits for ports to free up.Perhaps lowering the max connections to each backend to closer to 28000 would alleviate the problem? Or changing the local port range to be more inclusive?
Take a look at nbproc setting and see if that helps by utilizing more then one core. For most hardware load balancers the amount of traffic you can handle is capped by your cpu/memory of the load balancer.
Outside of the configuration of haproxy it would help to do some network tuning.
One specific thing that may help is ensuring your network interfaces aren't pinned to a single CPU (assuming you are using multiple interfaces). If you are running haproxy on Linux you can check the balance like so:
For example, this shows that the interrupts for eth0 and eth1 are being handled by different CPUs:
Whereas this shows them being handled by the same CPU:
You will want to enable smp affinity for these interfaces. For the example above you can do the following:
I suggest activate "Multi-thread mode" by putting nbthread option to global section. From man:
We activated "Multi-thread mode" and our web-site starts to work on x15 faster. You can read more about multi-process and multi-threads option here: https://www.haproxy.com/blog/multithreading-in-haproxy/