We were targeted earlier today by a DDoS attack. There was 20x as many connections as normal on our load balancer (HAProxy), and all the backend nodes continued to go down during this attack.
System structure: HAProxy > Squid > Apache (for ModSecurity) > IIS app layer.
During the attack, I noticed that there was a MaxClients Reached error in Apache, so I bumped the setting from 150 to 250, and it seemed to help to some extent. However, it seemed that I had to keep restarting Apache manually in order for the backends to recover. The attack lasted for about 50 minutes.
After the attack began to subside, a final Apache restart on each node brought us into the green, but now I'm looking into why it occurred in the first place. In the error logs in Apache, I see a lot of these:
[Wed Jun 22 11:46:12 2011] [error] [client 10.x.x.x] proxy: Error reading from remote server returned by /favicon.ico
[Wed Jun 22 11:46:13 2011] [error] [client 10.x.x.x] (70007)The timeout specified has expired: proxy: error reading status line from remote server www.example.com
Apache is using default keep-alive settings (keep-alives are enabled and timeout is set to 15 seconds). After doing some additional reading on HAProxy + keep-alives, is it a reasonable conclusion to believe that the DDoS was worsened by keep-alives being enabled?
While the HAProxy max connections are way below the maximums set in Apache, perhaps with the 20x connections too many connections were being opened in the ol' DOS fashion, but then Apache was keeping them open.
I think you're going after the wrong potential fix for this scenario. If you're being DDoSed then the only real route of mitigation you have is to talk to your upstream providers and get them to null-route/blackhole the traffic before it gets to your network. Otherwise, no matter what you do, it'll still be reaching the edge of your network, and potentially (probably) saturating the connection at your end.
The only thing to do is to have it blocked before it reaches the edge of your network. Any kind of DDoS mitigation scenario is unlikely to be as useful, as the traffic has to get onto your network before it can be ignored/blocked/dropped. As a result, it'll still eat your bandwidth.
In addition, simply increasing the number of available workers can make the problem worse if you don't actually have enough memory available for all those child processes. You'll start swapping to disk and your machine will grind to a halt. Surprised that no one mentioned mod_evasive or mod_security, too; having some automated heuristics to block access to computationally-expensive resources helps quite a bit in the case where you upstream won't or can't nullroute.
EDIT: this was a comment, but I turned it into an answer per @Tom O'Connor's suggestion.
@Tom O'Connor this is not a bandwidth/pps type of DDoS. Sounds to me like a simple service denial.
Keep alive will make it worse, you problem here is that Apache can't process requests as fast as it should and spawns a lot of workers that are unable to keep up with requests. As this grows the chances of recovery are pretty much at zero if attack continues.
You can obviously increase MaxClients directive but from what you described it will just make you go down a minute or two later.
I'm not sure what stack you are running but the goal for you is to simply improve Apaches response to a single request (are you running PHP? Is it connecting to MySQL? Are you not caching?) page that loads within 0.010 seconds will respond 100 times better to a service denial .vs page that looks up tons of stuff in MySQL and finishes in 2 seconds per request.
If somebody makes 100 requests, your server has to work for 200 seconds but since it does it all at once that 2 seconds/request is now 40 seconds/request * 100. More requests, more load.
Another way to address this is to identify top xyz connections and simply block them, but this will be a little bit more tricky and requires a bit more knowledge to properly attempt.
After having the issue crop up a few more times in the weeks following the initial "attack", I had to dig deeper as I thought I may have been using DDoS as a cop-out.
While the access logs and netstat snapshots (get the top N IPs ordered by number of connections appended to a log file) definitely showed a very distributed amount of IP addresses, I was able to identify a specific page in the access logs that seemed suspicious.
Apparently, the development team had built a "proxy" page in order to serve up 3rd party API requests via AJAX. The issue appears to be that this proxy page was using up valuable connection slots on HAProxy, and when the 3rd party service had issues serving API requests, it would wait a very long time to timeout. Eventually, the long winded proxy requests brought our HAProxy backend to the maximum limit (so all new requests were queued). From that point on, the connection counts began to build up on our network, and our public facing website started timing out normal non-AJAX requests.
The solution, in our case, was to create an additional backend in HAProxy specifically for these AJAX calls. Next time the 3rd party service has issues, it will only time out the AJAX proxy page calls, and the rest of the site will continue to hum along.
Thanks for the answers. I think most of you were spot on for mitigating a "real" DDoS attack, but I think it's helpful for other readers to know that it's worth it to look internally to make sure you're not shooting yourself in the foot.