I'm doing stress testing of our system. Currently we have 5 m1.large instances running behind ELB, sitting in east region. In west region, there are 3 small instances (with JMeter) that I use to hammer the system.
While doing a test that only pushes the app instances to about 80%-90% of their CPU limit (our choke point at the time), I'm seeing an odd behavior, ELB reports that ALL 5 instances are "Out of service - Transient Error - Please check later", all instances stop getting requests, and after about 5-10 seconds everything goes back to normal. This happens every 30 seconds or so. BUT! This doesn't happen every time I run the test. I just ran a half an hour stress test, with the same settings and everything worked perfectly. What is going on?
Btw my health check is
Ping Target: HTTP:80/index.html Timeout: 60 seconds Interval: 300 seconds Unhealthy Threshold: 10 Healthy Threshold: 2
So there is no way it's failing that. I've never ran into this until yesterday.