I'm doing stress testing of our system. Currently we have 5 m1.large instances running behind ELB, sitting in east region. In west region, there are 3 small instances (with JMeter) that I use to hammer the system.
While doing a test that only pushes the app instances to about 80%-90% of their CPU limit (our choke point at the time), I'm seeing an odd behavior, ELB reports that ALL 5 instances are "Out of service - Transient Error - Please check later", all instances stop getting requests, and after about 5-10 seconds everything goes back to normal. This happens every 30 seconds or so. BUT! This doesn't happen every time I run the test. I just ran a half an hour stress test, with the same settings and everything worked perfectly. What is going on?
Btw my health check is
Ping Target: HTTP:80/index.html Timeout: 60 seconds Interval: 300 seconds Unhealthy Threshold: 10 Healthy Threshold: 2
So there is no way it's failing that. I've never ran into this until yesterday.
We were also having a transient "boxes fail health checks for no good reason" problem and from working with Amazon support it turns out there is an interaction between the ELBs and the Apache KeepaliveTimeout. If the health check interval is larger than the timeout then the healch checker can try to reuse a bad connection and it fails the test and tosses your instance out of the ELB. They called our 60 second interval "unusually long." We're messing with it now but try setting your interval low and matching it with the keepalive setting in Apache.
The best way to stress test ELB is to get the ips used behind the cname they provide. Used those to hit the load balancer. Make sure there is at lease one image in every az you selected for the ELB. Amazon dynamically scales the ips behind the ELB, Your load balancer is probably hitting just a single ip. I'm not sure about the sporadic behavior you're experiencing.
It may be due to DNS caching on JVM or OS level so all your requests are hammering 1 ELB IP instead or being distributed so ELB itself becomes a point of failure instead of providing failover.
Starting from JMeter 2.12 and above DNS Cache Manager configuration element can be used for testing load-balanced applications.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps guide for more detailed explanation and instructions.