We rely heavily on memcache and are serving a few billion requests per month. We have 5 memcache servers. Last night, we saw an 25% increase in our traffic. The graphs show that requests and data transfered by each memcache increased and made them crash. It started a chain reaction and each memcache server crashed one after another (Load per server increased).
We found no logs in syslog, messages, memcache log file (Verbose settings was off).
I have two questions:
How can I find out why exactly this happened. If load is an issue for memcache, is there any documentation on how much a normal memcache (running on decent config) can handle. How can I increase this value.
How can I ensure they never go down again. It eventually impacted our mysql servers and replication and impacted a lot of other related services. Do I need more memcache servers?
I started my memcache using this init.d script: http://pastebin.com/wfMnB4ta where ENABLE_MEMCACHE is YES in /etc/default/memcached
/usr/share/memcached/scripts/start-memcached: http://pastebin.com/LaUugXye
Thanks
I'm going to guess that you run version 1.4.5 or older.
Since you mention an increase in traffic, then a sudden exit:
If you ever experience a crash, the first thing to do is make sure you're on the latest stable release. If you still experience crashes, the best thing to do is to contact the actual mailing list or file a bug report with information, rather than get lucky with a maintainer seeing this via a twitter search.
Doing periodic upgrades to match the latest stable can help you avoid having your whole cluster crash in the future.
You should also work out some kind of structural solution to deal with similar problems. For example, if you notice that the response time on requests is increasing, reduce the number of requests. You can do this various ways, including disabling non-essential services.
This particular failure would likely not have been avoidable though. There's not much you can about a failure causing increasing load.