===SOLVED===
This issue was solved. Turns out that ImageMagick has trouble with multiple CPUs. Compiling ImageMagick to use one CPU solved the problem.
================
I added a new web server as an upgrade but it falls over within seconds.
The old box has 8 Xeon cores at 2.33GHz. The new machine has 16 Xeon cores at 2.40GHz. Memory is 8G and 32G on the new machine.
The other major difference is a leap from 32 bit to 64 bit.
OS is CentOS 5.6 on both and Apache is 2.2.3-45 on both as well.
PHP is 5.2.10 and compiled by hand. configure options are identical except for the architecture bits.
From all of this info, you would think the new machine would scream but the current box handles the load and falls over occasionally. The new machine dies every time in less than a minute.
Memory is fine, I/O is good, but CPU is pegged hard. Here's the output from mpstat from both
old box
09:14:18 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
09:14:20 PM all 31.34 0.00 2.62 9.68 0.12 1.00 0.00 55.24 11163.50
09:14:20 PM 0 53.00 0.00 5.50 16.00 0.50 6.50 0.00 18.50 10249.50
09:14:20 PM 1 36.68 0.00 2.51 11.06 0.00 0.00 0.00 49.75 126.00
09:14:20 PM 2 17.41 0.00 1.99 7.96 0.00 0.00 0.00 72.64 125.50
09:14:20 PM 3 41.00 0.00 3.00 9.00 0.00 0.00 0.00 47.00 125.50
09:14:20 PM 4 30.00 0.00 2.00 7.50 0.00 0.50 0.00 60.00 143.00
09:14:20 PM 5 28.50 0.00 2.00 12.00 0.00 0.00 0.00 57.50 142.50
09:14:20 PM 6 22.61 0.00 1.51 7.54 0.00 0.00 0.00 68.34 125.50
09:14:20 PM 7 21.50 0.00 2.50 6.50 0.00 0.00 0.00 69.50 125.50
new box
09:13:41 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
09:13:43 PM all 98.69 0.00 0.81 0.00 0.03 0.47 0.00 0.00 4723.50
09:13:43 PM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1000.50
09:13:43 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 3 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 4 98.01 0.00 1.49 0.00 0.00 0.50 0.00 0.00 0.00
09:13:43 PM 5 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 6 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 7 98.51 0.00 1.49 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 8 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 9 99.50 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 10 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 11 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 12 95.50 0.00 4.00 0.00 0.00 0.50 0.00 0.00 84.50
09:13:43 PM 13 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 14 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:13:43 PM 15 87.56 0.00 4.98 0.00 0.50 6.97 0.00 0.00 3640.0
Traffic comes in through a load balancer and is split 50/50 between the two. As soon as I turn on the new machine, load spikes to 200 and I have to turn it off as it stops taking requests.
strace against httpd doesn't seem that revealing but here's the output from an strace -c -f -p
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.52 2.763912 419 6594 1663 futex
8.65 0.325110 55 5869 4099 open
5.35 0.201250 107 1873 381 stat
3.12 0.117305 67 1748 165 lstat
2.30 0.086434 2010 43 wait4
1.64 0.061543 7 8825 769 read
1.31 0.049158 125 394 clone
0.77 0.028874 53 543 chdir
0.75 0.028356 29 973 munmap
0.34 0.012783 35 370 times
0.30 0.011298 257 44 madvise
0.24 0.008897 7 1312 fstat
0.22 0.008225 1 9341 2 poll
0.18 0.006682 2 2777 14 write
0.14 0.005358 5 1184 mmap
0.13 0.005020 19 262 set_robust_list
0.13 0.004990 3 1688 30 writev
0.13 0.004799 7 671 598 access
0.08 0.003194 0 6531 recvfrom
0.06 0.002404 4 673 8 sendto
0.06 0.002398 4 578 getcwd
0.06 0.002367 5 491 mprotect
0.05 0.002013 4 457 brk
0.05 0.001965 2 883 semop
0.05 0.001924 3 760 lseek
0.04 0.001622 2 845 setitimer
0.04 0.001525 4 412 epoll_wait
0.04 0.001486 1 2595 close
0.04 0.001430 3 412 accept
0.04 0.001429 3 433 231 connect
0.04 0.001388 1 1185 rt_sigaction
0.03 0.000999 2 594 rt_sigprocmask
0.03 0.000963 0 2325 fcntl
0.02 0.000935 1 690 setsockopt
0.01 0.000393 1 534 socket
0.01 0.000380 1 393 12 shutdown
0.00 0.000158 1 127 setuid
0.00 0.000156 0 411 getsockname
0.00 0.000156 2 70 46 unlink
0.00 0.000080 0 254 epoll_ctl
0.00 0.000000 0 64 ioctl
0.00 0.000000 0 38 6 select
0.00 0.000000 0 10 alarm
0.00 0.000000 0 230 getsockopt
0.00 0.000000 0 3 rename
0.00 0.000000 0 22 getrusage
0.00 0.000000 0 127 setgid
0.00 0.000000 0 254 geteuid
0.00 0.000000 0 127 setgroups
0.00 0.000000 0 127 epoll_create
------ ----------- ----------- --------- --------- ----------------
100.00 3.759359 67166 8024 total
========== EDIT / UPDATE ==========
I found that when I limited traffic to 10% on the load balancer as suggested, it still crumbled. When I beat on it with siege and 400 connections, it held up really nicely. Load increased but hovered around 6 and served all requests.
I have access logs disabled but I enabled for a bit and told the load balancer to start sending traffic again. I let this run until load hit 200 which was about 3 minutes and saved the log.
I parsed the log for requests to use with siege. This would give me a more accurate benchmark.
Sure enough, with no live data but just me hitting it, I spiked load to 200. I started chopping the file in half and testing top and bottom half. I'm continuing this until I can find the specific request or requests that break the server.
So far it's looking like stuff that makes heavy use of ImageMagick but I've whittled down 10K GET requests to 50 and still going.