When configuring HA Proxy, how do you decide what values to assign to the timeouts? I've read a half dozen samples in various blogs, and everyone uses different timeouts and no one discusses why.
HAProxy seems specifically worried about client, connect, and server, which HAPRoxy throws a warning about if you leave completely unset:
While not properly invalid, you will certainly encounter various problems
with such a configuration. To fix this, please ensure that all following
timeouts are set to a non-zero value: 'client', 'connect', 'server'.
The documentation is unhelpful in this regard: it suggests "slightly above multiples of 3 seconds" but not why you'd choose a multiple of 1 vs 100 or 42.
The RPM I'm using (Amazon Linux repository) sets these defaults:
timeout connect 10s
timeout client 1m
timeout server 1m
Two of which are exact multiples of 3 seconds, violating the only official advice I've seen.
If you don't have specific tuning advice, maybe an easier question is: what should I expect to go wrong with really short or really long timeouts?
The TCP RTO (receive timeout) starts at three seconds. (RFC 1122) If a transmitted packet hasn't had an acknowledgement returned in that time, then it's assumed to be lost and retransmitted. This is almost certainly what the author is referring to. (Note that the RTO gets tuned up or down dynamically by various algorithms, outside the scope of this question.)
Keep in mind that this really only applies to connections between your frontend server and the clients (i.e. web users). In normal scenarios, the connections between HAProxy and your backend servers should be on a LAN and you should use much shorter timeouts, so that malfunctioning backends get taken out of service sooner.
As for your web users, some of them may be on very high latency connections, such as satellite, and may experience higher than normal retransmits due to this. The RTT on a connection where a satellite is in use may exceed 2000 ms even if all is well.
With all this in mind, you will generally want very short timeouts for
timeout connect
and very long ones fortimeout client
.For
timeout server
, this depends on your web application. When setting the timeout, consider the complexity of the web app being served, and how long it might take in the worst case to process a complex request. If in doubt, raise the value.Foreword
I've been tuning HAProxy for a while and done a lot of performance testing on it. From 100 HTTP requests/s to 50 000 HTTP requests/s.
The first advice is to enable the statistics page on HAProxy. You NEED monitoring, no exception. You will also need fine tuning if you intend to go past 10,000 requests/s.
Timeouts are a confusing beast because they have a huge range of possible values, most of them having no observable difference. I have yet to see something fail because of a number 5% lower or 5% higher. 10000 vs 11000 milliseconds, who cares? Probably not your system.
Configuration
I cannot in good conscience give a couple of numbers as 'best timeouts ever for everyone'.
What I can tell instead is the MOST aggressive timeouts which are always acceptable for HTTP(S) load balancing. If you encounter lower than these, it's time to reconfigure your load balancer.
timeout client:
Read: This is the maximum time to receive HTTP request headers from the client.
3G/4G/56k/satellite can be slow at times. Still, they should be able to send HTTP headers in a few seconds, NOT 30.
If someone has a connection so bad that it needs more than 30s to request a page (then more than 10*30s to request the 10 embedded images/CSS/JS), I believe it is acceptable to reject him.
timeout server:
Read: This is the maximum time to receive HTTP response headers from the server (after it received the full client request). Basically, this is the processing time from your servers, before it starts sending the response.
If your server is so slow that it requires more than 30s to start giving an answer, then I believe it is acceptable to consider it dead.
Special Case: Some RARE services doing very heavy processing might take a full minute or more to give an answer. This timeout may need to be increased a lot for this specific usage. (Note: This is likely to be a case of bad design, use an async style communication or don't use HTTP at all.)
timeout connect:
Read: The maximum time a server has to accept a TCP connection.
Servers are in the same LAN as HAProxy so it should be fast. Give it at least 5 seconds because that's how long it may take when anything unexpected happens (a lost TCP packet to retransmit, a server forking a new process to take the new requests, spike in traffic).
Special Case: When servers are in a different LAN or over an unreliable link. This timeout may need to be increased a lot. (Note: This is likely to be a case of bad architecture.)
timeout check:
Read: When performing a healthcheck, the server has
timeout connect
to accept the connection thentimeout check
to give the response.All servers MUST have a HTTP(S) health check configured. That's the only way for the load balancer to know whether a server is available. The healthcheck is a simple
/isalive
page always answeringOK
.Give this timeout at least 5 seconds because that's how long it may take when anything unexpected happens (a lost TCP packet to restransmit, a server forking a new process to take the new requests, spike in traffic).
War Story: A lot of people wrongly believe that the server can always answer this simple page in 3 ms. They set an aggressive timeout (< 2000ms) with aggressive failover (2 failed checks = server dead). I have seen entire websites going down because of that. Typically there is a slight spike in traffic, backend servers get slower, the healthchecks are delayed... until suddenly they all timeout together, HAProxy thinks ALL servers died at once and the entire site goes down.