I should note that I'm not a sysadmin. You'll figure that out very shortly. :)
In a nutshell: Apache keeps taking a breather during heavy loads and all processes go idle. This is a polling server that is used by applications. The polls come from a lot of different endpoints. From time to time (every 4-5 minutes) if I'm watching top, HTTPD processes go idle all at the same time, stalling traffic for 10 seconds or so. It then recovers. The delay is problematic.
- Server is serving a lot of traffic. These are application polls via HTTPS, not web pages (though I doubt Apache knows the difference)
- The pauses noted above cause the traffic to become lopsided: after some time, I get a WHOLE BUNCH OF TRAFFIC, then a lull, then a WHOLE BUNCH OF TRAFFIC again
- Each poll requires a small database dip
Apache logs
Sometimes, but not always (mostly after a restart), I get these messages in error_log. Most of the time when it happens, I see nothing in the error_log.
[Mon Jun 30 17:55:17 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 31 idle, and 98 total children [Mon Jun 30 17:55:18 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 14 idle, and 98 total children [Mon Jun 30 17:55:44 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 74 idle, and 99 total children [Mon Jun 30 17:55:54 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 61 idle, and 99 total children [Mon Jun 30 17:56:00 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 0 idle, and 97 total children [Mon Jun 30 17:56:02 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 36 idle, and 99 total children [Mon Jun 30 17:56:03 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 39 idle, and 99 total children [Mon Jun 30 18:08:17 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 18 idle, and 99 total children [Mon Jun 30 18:08:18 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 63 idle, and 98 total children [Mon Jun 30 18:08:19 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 74 idle, and 97 total children
Apache Config (old config commented out)
just showing config items that I suspect are relevant
#Timeout 60 Timeout 20 KeepAlive on MaxKeepAliveRequests 1000 KeepAliveTimeout 2 IfModule prefork.c StartServers 85 MinSpareServers 85 MaxSpareServers 100 ServerLimit 100 MaxClients 100 #StartServers 60 #MinSpareServers 60 #MaxSpareServers 85 #ServerLimit 85 #MaxClients 85 MaxRequestsPerChild 1000 /IfModule
Note that there's no difference between old and new configs in behavior.
Environment EC2, c1.medium, mod_perl, persistent database connections, separate RDS server, no errors showing in MySQL error logs and no errors showing in Apache logs
As an aside, I've seen suggestions to install mod-status, but i haven't figured out how to do so, and I don't know what to look for if I do.
Mystery solved.
In case this happens to anyone else: The network connection (inside VPC via LAN interface) between Apache and database server was getting congested. Upgrading the database server to a larger instance solved the problem (for the time being).
Background: Amazon takes snapshots of your database every 5 minutes for its point-in-time restore feature. It downloads the binary log on your RDS instance to do so.
Every 5 minutes, the binary log gets transmitted (presumably to another EBS), and in my case that transmission congested the LAN interface. Apache stalls while it waited for the network connection every five minutes, and connections would pile up, with some ultimately aborting.
I'd up the MaxClients setting to about 200...
.Also, I am curious as to why the Min and Max spare servers are so high. I'd probably set MinSpareServers to something like 20 and the MaxSpareServers to something like 30. These are the number of basically idle processes that remain, apache will create more as needed up to the MaxClients setting and reduce the number of spare processes as the demand lessens.
Finally, why are you initially creating so many initial servers. I'd start with something like setting StartServers to maybe about 50.
I had the same issue, turned out that it was caused by having an underscaled RDS instance. Cpu and memory were always below 15% so I didn't bother to upgrade it until reading the OP's answer.
Changing the RDS instance from t2.micro to t2.medium solved my problem.
It was really hard to troubleshoot because it's not so evident from the instance stats, the only thing I could notice were small peaks on Input and Output bandwidth graphs.