My company is launching a new website with potentially large waves of visitors in very short windows (estimate is around 14k visitors in a 2 minutes window).
So, I'm reviewing our configuration, and my biggest problem right now is our single node HTTP frontend that uses keep-alive. The frontend is running lighttpd 1.4 on CentOS 5.4.
Some assumptions:
- a browser usually opens 6 parallels TCP connections to keep-alive
- the browser will keep the connection open until the timeout is reached, even if the tab is closed (observed in FF, might not be true on every browser)
- on the server side, each connection will consume ~150K of memory in the kernel (I use conntrack and want to keep it, is that estimation correct ?)
- all of our servers are hosted on the east coast. the RTT from a server in las vegas is around 80ms.
- The home page with keep-alive uses ~25 TCP connections and 1500 packets. Without keep-alive, this number rises to ~210 TCP conenctions and over 3200 packets.
So, 6*14000 = 84,000 TCP connections. 84,000 * 150KB ~= 12GB of memory. Here is the problem: 1. I don't have that amount of memory available on the front end. 2. lighttpd 1.4 is not very comfortable with that amount of connections to manage. it hurts the hits/s a lot.
But on the other end, I'm concerned about the 80ms RTT if I deactivate keepalive.
I am going to mitigate some of these issues with a CDN and a secondary www record with a secondary lighttpd. but the debate concerns the keep-alive feature. I'd like to turn it off, but I'm worried that the impact on page opening time is going to be high (high RTT, and double the amount of packets).
Once of the content retrieval done, we have a lot of ajax requests for browsing the site that usually fit in a single tcp connection. But I'm not certain that the browser will free the other connections and just keep one open.
I know there have been a number of discussion about keep-alive consuming to much resources. I kind of agree with that, but given the assumptions and the situation (a RTT between 80ms and 100ms for half our users), do you think it's wise to deactivate it ?
As a side question: do you know where I can find the information regarding connection size and conntrack size in the kernel ? (other than printf size_of(sk_buff) ).
--- edit: some test results I configured conntrack to accept 500k connections (given the memory footprint, it shouldn't exceed 200MB) and launch an ab test.
ab -n 20000 -c 20000 -k http://website.com/banner.jpg
From what I saw in tcpdump, ab establishes all connections before doing the GET. So I get an idea of how much memory is consumed by those 20k connections.
slabtop returns
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
40586 40586 100% 0.30K 3122 13 12488K ip_conntrack
and top
PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA COMMAND
15 0 862m 786m 780 S 22.9 3.3 1:44.86 76m 172 786m lighttpd
12MB for ip_conntrack and 786MB for lighttpd are OK for my setup. I can easily manage 4x that.
So, keepalive that is, with a idle timeout set to 5 seconds.
Why not to set keepalive timeout to, say, 15 seconds? I don't see a reason to keep every connection for 2 minutes. And I don't think the browser will keep the connection for 2 mins according to this link: http://en.wikipedia.org/wiki/HTTP_persistent_connection#Use_in_web_browsers, 1 minute timeout seems to be more realistic.
I d advise for very low (1s ) keep-alive timeouts.
if you did your basic work on client-side perf (correct ordering of JS / CSS, JS at bottom or async...) you can set your keep-alive very low (1s ), 'cause there won't be much delay between 2 reuses of your TCP connections.
In doubt, webpagetest your page with 5s and 1s (connection view), you ll see if that breaks the reuse of your TCP connections.
I tried having it to 0s in an Apache frontend, and that is definitely too small, but 1s in good