When running a large mod_perl application on EC2, we've noticed that the CPU usage on the instance climbs gradually between restarts or graceful restarts.
Current setup: CentOS 5.4, m2.4xlarge instances, apache 1.3 with mod_perl.
We first noticed this when tracking the speed of memcached requests from the application. As each apache child process gets older, it takes longer to read/write to a memcached instance running on the same host. We found that doing an apachectl graceful every hour would prevent this happening, at the expense of a load spike each time.
Whatever is causing this slowdown is also noticeable in our ganglia monitoring. We've been running one server without the hourly restarts, and although its speed serving requests is the same as the other servers, the CPU usage is always higher. The "load" figures are about the same, but the system CPU figure is higher.
I'm scratching my head to work out what is going on here, as restarting apache children every hour means that we miss out on in-process caching benefits.
Has anyone seen something similar? This doesn't seem to affect our application when it's run on real hardware in one of our data centres, although there we use SUSE.
UPDATE 1: Thanks voretaq7. We chose the m2.4xlarge instance type which has 68GB RAM. Our current apache tuning (160 children running at all times) has us using about half that only, so we have swap turned off. There isn't any wait CPU or stolen CPU, as the instance size means we're not sharing the undelying host with anyone else. We're seeing user and system CPU, with more system CPU than on the boxes where we're doing the graceful restarts every hour.
UPDATE 2: I'm currently running another trial with three servers. One is running an apachectl graceful every hour, one is set with MaxRequestsPerChild = 512, and the third with MaxRequestsPerChild = 64. This is to try and work out if it's the graceful restart fixing the parent somehow, or if it's that the children just aren't running as long. I'll run with this setup for 12 hours and compare the stats.
UPDATE 3: Running the children with a smaller value for MaxRequestsPerChild did silightly improve things. However the host doing the graceful restarts still performed better.
UPDATE 4: Each host is running three apache instances (totalling 160 children) and three memcacheds. With only a CPU cores, I was wandering about the cost of context switching. I ran a trial where one host had all memcacheds pinned to CPU0, and apache pinned to CPU1-7. This made a marked improvement to performance. I still don't know exactly what is causing CPU usage to degrade between apache restarts, but it looks as if a combination of CPU affinity, occasional graceful restarts, and shorter child lifetime can speed things up.
# Start the three memcacheds as follows
/bin/taskset --cpu-list 0 /usr/local/bin/memcached -d -p 12345
# Start apache as follows
/bin/taskset --cpu-list 1-7 apachectl start
Any CPU affinity applied to the apache parent process will apply to all children it spawns.
It looks as if context switching is a major contributor to the slowness of apache. Pinning memcached to one CPU and apache to all the others sped things up considerably. See the UPDATE 4: in the question for details.