I'm seeing high load average at regular times on one of my sites. I have alerts setup, but it's not obvious what's causing the high load, so I'd like to capture the state of the system when the alert goes off.
What's a good way to capture the relevant pieces of information so I can post-facto determine what's causing the load?
This is linux/ubuntu, apache, mod_python/django, mysql.
I like to use a program called atop. http://www.atoptool.nl/ Similar to top but it also grabs snapshots of the atop window at user defined intervals. Set
INTERVAL=60
in/etc/atop/atop.daily
to get 1 minute snapshots. Runatop -r /var/log/atop/atop_20100214
to view 1 minute intervals for a particular date. Use t and T keys to view forward and backwards through time. These file paths are for CentOS, yours may be slightly different.If Ubuntu has sar, then that can capture System disk usage, vm activity, etc. Once you setup the computer to collect data then you can run reports for both busy and non-busy times to compare activity. Apache has mod_status and mysql has some statistics tools, you could probably get something from them periodically through cron.
NewRelic has excellent tools for monitoring causes of server load. Both from an Application and a Server perspective.
Application monitoring such as slow SQL queries, error rates etc
Server monitoring metrics such as network/disk/RAM/CPU utilization rates
User monitoring such as performance by page, location, browser; and load time breakdown between app, network, DOM and rendering.
We've used it here for nearly 12 months now and it's been invaluable. And you get a free shirt.