I have Windows Server 2008 running under VMware.
Recently, its started to crash roughly every day, with continuous 100% CPU utilization, and no response in the GUI.
Is there a step-by-step technique to track down the source of this problem?
What logs would I look at?
p.s. The problem appeared around the time I tried to uninstall Acronis, and it blue screened. However, I'm not sure if the current faults are related to Acronis at all.
You can also use the "Reliability and Performance Monitor" that is available under Windows Server 2008.
As you can see below, it automatically keeps a record of the reliability of the server, and assigns it a "reliability score" out of 10. This score starts at 10, and drops if the server experiences any crashes or unexpected shutdowns.
It even keeps a record of which programs were installed, and when, so you can diagnose if an installed program seemed to cause more faults.
You can also set it up to continuously log the CPU usage of programs, to see which program is causing the 100% CPU utilization.
If there is a crash-dump like c:\windows\memory.dmp you can use the WinDbg to analyze it. Usually you want to look for third party drivers in the dump. Step-by-step instructions can be found here.
The System event log. The Application Event log. Google the message of the BSOD. Check the disk's integrity with chkdsk.
You have two options:
Logs are a good start for looking back at the history of the system, if you know the time where the problems start or the logs are quiet enough for you to notice a pattern leading to the pegged CPU. If the system BSOD you can throw the dmp's into windbg.
If you're looking for things that could lead to the CPU spikes:
Once you have a good candidate for the problems you can turn on Process Monitor from sysinternals. It will dump every file and registry interaction that every process on the system is doing in real time. It can even be configured to load at boot and capture everything until you run the GUI next (be warned this is A LOT of data, so it's only advisable if you can replicate the problem quickly after boot)
There are a bunch of rabbit holes that an root cause analysis can take you down, feel free to let us know on how it goes.
If it is blue screening, check out the minidump file: http://support.microsoft.com/kb/315271
... this will tell you (usually) the driver or piece of software that caused the crash.
2009-07-06 - I'm thinking its the hard drive.
I did a chkdsk, and it crashed with the same symptoms as before half way through the chkdsk. I'm using a Solid State Drive (SSD), the "PQI DK9128GD6R000A03 128GB SATA 2.5" SSD", with a MTBF of 1,500,000 hours. Despite having a MTBF of 133 years, it seem to have died after 2 weeks or normal use! To check my theory, I copied the VMware files to a standard hard drive. Ran chkdsk, and it worked like a charm. I'll see if the system survives a week of uptime, and if it does I can officially defenestrate my PQI SSD.
2009-07-07 - System crashed again. Back to the drawing board.
2009-07-08 - Rolled back a further 20 days to before I installed the SSD. We'll see if it crashes again (it did).
2009-07-09 - uninstalled OpenVPN, upgraded to the latest version of Skype, upgraded to SQL 2008 to SP1, removed TeamViewer. We'll see if it crashes again (it did, in the middle of an Acronis backup).
2009-07-09 - suspect that the amount of virtual memory available the VMware machine that runs the server is too small, I've got it at 4GB at the moment. Increasing it (this had no effect).
2009-07-09 - discovered that if the VMware container running Windows Server 2008 crashes with 100% CPU utilization, and I pause/restart it, then it uncrashes and resumes operation! This tends to point to a problem with VMware or its host OS (which is XP), rather than a problem within the Windows Server 2008 itself. Getting very close to the heart of the problem now.
2009-07-09 - Windows Server 2008 only crashes when the host OS is under very heavy load. Increased the number of CPU's it can utilize to 2 CPU's, this seems to have fixed the problem.
In conclusion:
Problem solved, thanks guys!
Could you please explain what do you mean by crash, is the server encountering BSOD or is it just hanging at 100% Cpu.
For troubleshooting, you can make the server log to a syslog server, run at intervals a script listing processes an their resources usage, writing his output to a network share.
If the server makes bsods try googling for the error code mentioned in the bsod.
Also, maybe the acronis uninstall left an error log with some informations in the installation folder.
Does it crash exactly each 24 hours (on the same time each day)?
If so, there is possibly a scheduled process that causes the crash.