We've got a web app hosted on Amazon Web services. Our database is a multi-az RDS MySQL server running 5.1.57 and 3-4 app servers talk to it.
Today, we started seeing a lot of errors along the lines of "Lock wait timeout exceeded; try restarting transaction" - almost 1% of POST requests are seeing this.
There have been no modifications to the code running on the site. There have been no schema changes. We haven't had a big spike in traffic. I've been looking at the processes running, and none seem out of control.
I tried scaling our RDS instance from a small to a large, with no effect.
Two days ago, Amazon had some outages. As part of the recovery from that, our RDS server, and our app servers ended up in different availability zones, but all within the same region. But yesterday, everything was fine so I'm not convinced that's related.
The lock timeouts are in different types of requests and occur in different InnoDB tables.
I have noticed the number of open connections jumped when we started seeing problems, but they may be a symptom and not a cause.
What are my next steps in debugging this?
What probably happened was an IO loss on one or more of the EBS volumes supporting the RDS instance. The amount of reduced IO due to an EBS remirror is pretty significant in it's effect on databases.
If you pay for Premium Support that team can look into the gritty details like that for you or you can try asking on the AWS forums. The RDS engineers could probably confirm the underlying EBS issues or what the cause was.