I run a reasonably busy (700,000 page views/day, php/mysql) site that gets steady traffic (normally no spikes). The last two days, around peak usage time, and for about an hour, my site had suddenly gone from being very fast to unresponsive, for about an hour, and then back to being super fast.
The CPU load jumps dramatically at 2:10AM :
12:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
12:10:01 AM 1 270 2.54 3.56 4.00
12:20:01 AM 10 270 5.58 5.09 4.61
12:30:01 AM 9 297 10.06 9.63 7.22
12:40:01 AM 7 296 3.42 5.17 6.15
12:50:02 AM 8 291 4.36 4.57 5.43
01:00:02 AM 20 297 9.38 7.57 6.49
01:10:01 AM 6 279 5.83 6.86 6.90
01:20:01 AM 11 263 5.77 5.43 5.98
01:30:01 AM 2 291 6.70 5.56 5.66
01:40:01 AM 2 285 3.73 5.09 5.37
01:50:01 AM 6 285 3.84 4.65 5.11
02:00:01 AM 8 283 2.56 3.72 4.45
02:10:01 AM 2 431 14.67 10.88 7.34
02:20:01 AM 1 425 7.10 11.48 9.73
02:30:01 AM 4 453 10.30 12.79 11.23
02:40:01 AM 2 440 14.12 16.13 13.41
Here are my stats :
Hostgator VPS Level 7, 2 x 2GHz CPU, 3.2G RAM, CentOS 5.9, Apache 2.2.19, MySQL
- Mysql did not show any abnormal load during this time
- Apache was showing all workers in "W" state.
- Rebooting, restarting mysql, restarting apache all did not resolve the issue
- Nothing abnormal in apache error log (except lots of 503 errors during this time)
I'm really not sure where to start investigating this issue. I'd appreciate any pointers with :
1 - how to fully diagnose this issue now 2 - or what tools to install/ commands to run to capture extra data when it happens again.
thanks in advance.
How to diagnose: - Plot the graphs. Use munin, cacti or other external monitoring system to get to know, what exactly kind of resource has ended. - Use atop or sar to get detailed information about processes activity in timeline. When you servers goes down, check dumps moving backward.
Problem turned out to be a misbehaving cpanel system cron job that was using up all the CPU, in turn causing apache to be unable to serve requests.