Please help! We've been struggling with this one for months. This week we upped our RDS instance to the highest performing instance and although the occurrences have reduced, we're still having our DB all of a sudden hit 100%. It comes out of nowhere. Sometimes 2am, sometimes midday.
I've ruled out a DOS - our pages access logs have normal traffic
I've ruled out memcached suddenly dieing (hits and misses continue as normal).
The SHOW PROCESSLIST while we have issues reports about 500 queries in queue. If I kill them off or restart the server, they just keep coming back and then eventually out of knowhere, our server resumes back to normal. Sometimes up to 3 hours.
Our bad performing queries take .02 seconds to execute when the server eventually returns back to normal but while we're in this 100% CPU physco phase, those queries never finish executing.
Please help!!!!! Anybody know anything about MYSQL query optimization? Could it be the server deciding to use different indexes all of a sudden, which puts it into a spiral?
http://www.mysqlperformanceblog.com/2010/09/10/cache-miss-storm/
As it turns out, our issue was a cache storm miss, AKA miss stampede.
We've solved it by implementing a 50% cache expiry. Basically for every item in memcache we create a second cache item with a similar key plus the appended 'regenerate' string. This item expires at exactly 50% of the typical cache expiry time, indicating to the next request that we're getting closer to expiring cache and the next request will need to try and regenerate cache.
This prevents the storm of users trying to regenerate cache at the same time and ensures our cache has the best chance of always staying fresh!
A Tricky one to track down!