I'm getting a bunch of apache errors that I'm having problems tracing down. They're on a RHEL system that runs a very high-volume Drupal website.
[Mon Sep 14 12:48:44 2009] [info] [client xx.xx.xxx.xx] (70007)The timeout specified has expired: core_output_filter: writing data to the network [Mon Sep 14 12:50:19 2009] [info] [client xx.xxx.xx.xx] (104)Connection reset by peer: core_output_filter: writing data to the network [Mon Sep 14 12:51:28 2009] [info] [client xx.xxx.xx.xx] (32)Broken pipe: core_output_filter: writing data to the network
Occasionally (every 24 to 36 hours) there will be a load spike and the site will become completely unresponsive. Load average climbs from a normal 1-1.5 to 200. Most of the httpd processes that are running will show as 'D' -- deadlocked -- and the only way to get the server to get back down to "interactive" is to three-finger-salute or wait until you get a prompt and killall -9 httpd
.
Obviously, the site can't be taken down for me to do a bunch of strace work. I've checked the apache configuration and (again) as far as I can tell, EnableMMAP and EnableSendFile are disabled. The files are on an NFS v3 mount, but neither the NFS server, nor the mysql server, nor anything else, is reporting errors. Nothing appropriate in the system log or dmesg. The site is also too high of a load to reconcile individual requests with errors resulting from them.
At this point, I'm thinking network hardware error and I'd prefer to bring the site up on a second machine. Anyone have any thoughts before I do this?
This is a wild ass guess but have you checked how many on-disk temporary tables Drupal is creating?
I have seen this cause iowait (load) problems.
mysqladmin -u root -p ext -ri 30 | grep Created_tmp_disk
First run will tell you how many on-disk temporary tables were created since last restart of MySQL. Then it will tell you how many are created in the 30 seconds time window (until you Control-C out of it).
The (band-aid) solution is to put MySQL's tmpdir on a RAM based file system (e.g. tmpfs).
I guess what I'm suggesting is that this starts the cascade - and the messages you're seeing are just abandoned connections.
Cheers
In short, in your apache config try the following:
EnableMMAP Off
Sendfile Off
In long:
Apache apparently mmaps files and attempts to use linux's sendfile (http://linux.die.net/man/2/sendfile) for performance when it's available, however according to the apache docs this can cause stability problems on network file systems if it fails to read the file,see:
http://httpd.apache.org/docs/2.0/mod/core.html#enablesendfile
They go into some specific info on this here:
http://httpd.apache.org/docs/2.0/faq/all_in_one.html#error.sendfile
You can find info on the EnableMMAP and EnableSendfile directives here:
http://httpd.apache.org/docs/2.0/mod/core.html#enablemmap
We managed to get the site balanced out by switching to InnoDB across the board and configuring the key cache properly as well as adding a bunch of memcache and other . All of the errors I quoted above were apparently caused by clients canceling requests for long-running processes, because as soon as we got the database tuned the errors went away.
add nginx to proxy your apache and serve static content directly. or even, replace apache completely. this will very much bring down apache loads.