just spent two days this week troubleshooting dev issue in EC2 logged by offshore team.
have run apache/tomcat Version 7.0.21 in muliple Dev instances in EC2 for weeks with no problem.
then major performance issues in D3 env. re-ran the scripts on shore with no problem first time.
again offshore logged defects in D3 env, this time they ran scripts in D2 clone had no issues. ran scripts in D3 again onshore in morning and this time had major issues.
had a feeling it was infrastructure but no way to prove it.
tuned servlet container looking at garbage collection, heap, jdbc pool - in sandbox env, nothing wrong.
then scripts passed in D3 clone image. all logged defects passed. we had changed nothing.
it looks like an EC2 issue, either on Xen VMs, network or RDS. No idea what it was.
How can you isolate fault in cloud when you are flying blind. With no visibility of infrastructure where do you begin?
Anyone have similar problem?
Can EC2 infrastructure be monitored?
Perry, it sounds like you correctly diagnosed the issue (spurious/random/unexpected behaviors on EC2 are almost always the side effect of degraded host hardware) - the only way you can confirm that is to post to the EC2 forums or open a support ticket and ask them to investigate at which point the EC2 team can confirm/deny the faulted hardware.
The workaround, whether you get it confirmed or not, is always to shutdown and relaunch your VM which will place it on different hardware. (You can see this in the EC2 forums regularly).
In the future I would make it an expected first-step at troubleshooting completely random issues on EC2 to do just that; restart an instance.
There is still no way to get real time alerts on the state of the underlying hardware on EC2, even the few email notifications that go through when the hardware fails seem to be random as hardware can still fail and you never receive one of those monitor emails.
You can try and point a monitoring service at your individual instances like pingdom or wasitup, but those are simple ping tests and I don't know if those will work for you.
Alternatively, if you can narrow down the failures you saw to specific things that were randomly failing (e.g. a certain operation that goes goofy on EC2 when the hardware starts failing) you could write a system script/cron job that just runs that exact service every 1min or 10mins and reports an error.
This is a canary-in-a-coal-mine approach and nothing scientific or exact, but it might help a bit and allow you to catch the problem before your users do.