We have a web app and API hosted at Amazon AWS, with three environments - development (dev), testing/staging (uat), and live. Each environment has a load balancer, two EC2 instances and an RDS database. We're relatively new to AWS and learning as we go to some extent, but on the whole it's working pretty well for us.
At 08:25 on Wednesday morning, we saw a sudden increase in response time from the boxes in the dev environment:
The three environments are running the same code and the same data schema. There is no corresponding increase in network activity, CPU utilisation, disk read/write activity. None of us has the faintest idea what's caused this sudden increase, or what we can do to troubleshoot it. A few people have said "oh, that's just cloud computing for you" but I can't quite accept that hosting at AWS just means sometimes your entire website will slow down by 1 second per request, for no reason, and you just shrug and ignore it.
What are my next steps here? How do I go about troubleshooting an issue like this?
Next step - contact AWS support (open a ticket) and explain the issue - ask them to look at the logs of the ELB. They might claim there is nothing wrong, but if the issue is repeatable, you're in luck - you can demand live support during an occurrence.