From time to time, something happens to our website that makes it slow and unresponsive. Inevitably, this happens at like 3 AM, on a day when all the devs have gone to bed early.
Are there any good tools for taking a "snapshot" of the state of the webserver and the db server at that moment? I want to get an email with a full report -- what was the CPU doing? any process thrashing disks? ASP.NET worker process queue out of control? long-running db queries?
This is for a Windows Server 2008 R2 box running IIS, and a SQL Server 2008 R2 instance.
Basically, I want to be able to see enough stuff that I have some hope of figuring out what was making things slow.
My guess is that your CPUs go into sleep state. So try to monitor the state of the CPUs...
If you want a complete quiesced copy of your server the most obvious way to do this would be by making the machine run as a VM on a decent hypervisor, they all support snapshotting and this would be easy to do.
Sounds like you could use a monitoring program with trending information. I don't know what solution will work best for you in the Windows world, but I will describe in general terms how I would tackle a similar problem on a Linux box using a monitoring program called Zabbix (Zabbix can monitor Windows, but the server must run on Linux). The methods will be different for you, but the concepts will be the same and hopefully they can be useful to you as a guide.
First I could configure Zabbix to monitor the CPU load on my host, along with memory and so forth. I would consider having it monitor the syslog on the host as well, though I'm always able to review the syslog on the local system.
I would then set up a trigger which would activate when the cpu utilization rose above 90% after hours. I would associate an action with the trigger which was a remote command which would run a script on the remote host. The script would pull a dump of the currently running processes, along with some other text data, and push it to the Zabbix server. The data would be pushed into a log item specifically meant for trapping this data within the host context. Alternatively I could have the remote script do a larger dump of system data and email it to specific users.
The best way to approach this issue would be performance data collection, as boring as it may sound.
For Windows hosts, in my humble opinion, Perfmon still provides best way to do it.This is what I would do: Run perfmon with basic counters before and during problem period (few hours), this should include main counters for ram, disk, network and cpu.Use Performance Analys Tool PAL http://pal.codeplex.com/releases/view/51623 to analyze resulting logs.PAL report in shape of HTML page should give you some graphs and warn you if you have performance problems. PAL also has profile for SQL servers performance.
For better understnading perfmon results and fixing underlying problems, I suggest reading following articles:
http://www.grumpyolddba.co.uk/monitoring/monitoring.htm ( section about counters) http://www.brentozar.com/sql/sql-server-performance-tuning/ http://www.sqlservercentral.com/blogs/sqlmanofmystery/2009/09/14/the-fundamentals-of-storage-systems-introduction/