I have an m1.medium Amazon EC2 instance running Apache and hosting a Wordpress blog. In turn, Wordpress works with a MySQL database over on a different EC2 instance. The Wordpress site has W3 Total Cache set up and working well, and much of the static content on the site is served from a CDN. The site runs a low amount of traffic regularly and then occasionally gets some huge traffic spikes.... and when those spikes occur (more than ~150 people accessing the site), the site goes down. I can also make this happen every time with with some load testing tools.
Here is the 'top' when the main server is idle:
top - 23:21:23 up 103 days, 19:40, 3 users, load average: 0.91, 0.60, 0.62
Tasks: 93 total, 1 running, 92 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3844856k total, 1756268k used, 2088588k free, 150132k buffers
Swap: 0k total, 0k used, 0k free, 833740k cached
However, if I do some load tests to simulate hundreds of users accessing a static graphic file (which obviously doesn't trigger Wordpress, PHP or the database), everything is fine: the server load stays low, the graphic file is served quickly, etc.
My Apache settings (3.1G of memory in server / ~8100k per httpd instance = ~400 MaxClients):
StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 400
MaxClients 400
MaxRequestsPerChild 0
So based on all that, it seems like the problem has to do with when PHP or MySQL are used.
Over on the MySQL server, no matter what I do, the load stays pretty much at 0, and the slow query log stays empty... so I think things are healthy there. Here is my 'top' for the SQL server:
top - 23:20:21 up 103 days, 19:12, 5 users, load average: 0.08, 0.03, 0.05
Tasks: 115 total, 1 running, 114 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3844856k total, 1076912k used, 2767944k free, 158412k buffers
Swap: 0k total, 0k used, 0k free, 638092k cached
This all leads me to think that one of these unlikely scenarios is occurring:
- The machine simply doesn't have enough raw horsepower to serve 150+ concurrent users, and I should move up to an L or XL EC2 instance. Maybe, but... really? 150 users is too many for a relatively powerful m1.medium server?
- Apache simply isn't built to handle this kind of traffic. Doubtful.
- There is a problem in the communication between the web and database servers. But I doubt that since this is between two Amazon EC2 instances.
I feel like I've checked everything I can and still no luck. What else should I check? What else can I try?
An m1.medium's 'horsepower' equates to one cpu in this case, with a load of .91 at your time of showing top. Load of .91 means "right now, 91% of one cpu's work is being requested by processes." In short, something seems to be starving your cpu while you're idle.
Assuming that's the issue, I'd pare down whatever service is eating your cpu. If that's not an option on your main server, I'd make an ami of your existing machine, then spin another two hosts, instance type t1.micro, and ensure only the bare minimum services are running on it (apache in this case.) Then round robin dns your website address. This will effectively triple your burst cpu capacity, while giving 2x cpu baseline.
We serve larger # of users over a smaller instance on EC2(and have used apache benchmark for up to 1000 concurrent sessions).
Good news is that you can reproduce it.
There are lot of things you can check:
Put a tool like new relic on both servers(use the free version to log cpu/memory history for starters). And this is good anyway for trending.
Start with communication between 2 servers. Do some file transfers and check the speed.
We had a issue with wp-cron.php killing the wordpress server from time to time. Try disabling that moving it to a cron call every so minutes.
Report back what you find and I can suggest few more ideas.