Ping a Specific Port

Question

Contango

Asked: 2009-07-04 03:38:45 +0800 CST2009-07-04 03:38:45 +0800 CST 2009-07-04 03:38:45 +0800 CST

How to track down the cause of Windows Server 2008 crashing?

772

I have Windows Server 2008 running under VMware.

Recently, its started to crash roughly every day, with continuous 100% CPU utilization, and no response in the GUI.

Is there a step-by-step technique to track down the source of this problem?

What logs would I look at?

p.s. The problem appeared around the time I tried to uninstall Acronis, and it blue screened. However, I'm not sure if the current faults are related to Acronis at all.

8 Answers

Voted

Contango · Answer 1 · 2009-07-11T02:07:19+08:00

Contango

2009-07-11T02:07:19+08:002009-07-11T02:07:19+08:00

You can also use the "Reliability and Performance Monitor" that is available under Windows Server 2008.

As you can see below, it automatically keeps a record of the reliability of the server, and assigns it a "reliability score" out of 10. This score starts at 10, and drops if the server experiences any crashes or unexpected shutdowns.

It even keeps a record of which programs were installed, and when, so you can diagnose if an installed program seemed to cause more faults.

You can also set it up to continuously log the CPU usage of programs, to see which program is causing the 100% CPU utilization.

enter image description here

6

Peter Hahndorf · Answer 2 · 2009-07-05T14:19:43+08:00

Peter Hahndorf

2009-07-05T14:19:43+08:002009-07-05T14:19:43+08:00

If there is a crash-dump like c:\windows\memory.dmp you can use the WinDbg to analyze it. Usually you want to look for third party drivers in the dump. Step-by-step instructions can be found here.

4

Dave Markle · Answer 3 · 2009-07-04T04:01:59+08:00

Dave Markle

2009-07-04T04:01:59+08:002009-07-04T04:01:59+08:00

The System event log. The Application Event log. Google the message of the BSOD. Check the disk's integrity with chkdsk.

3

Bob · Answer 4 · 2009-07-06T00:13:08+08:00

You have two options:

Look at records to try and figure out what caused past problems
Look for signs of things that could lead to the CPU spikes in an attempt to replicate the problem

Logs are a good start for looking back at the history of the system, if you know the time where the problems start or the logs are quiet enough for you to notice a pattern leading to the pegged CPU. If the system BSOD you can throw the dmp's into windbg.

If you're looking for things that could lead to the CPU spikes:

Process Explorer from sysinterals: look for odd processes or open handles to files or network shares that don't exist anymore. It may point you in the right direction to replicate the problem
Windows Reliability and Performance Monitor / Perfmon: You can see how each process is acting in regards to Disk/CPU/Memory/Network usage as well as hundreds of other counters. They may give you a clue as to what is running away with the VM before it happens.

Once you have a good candidate for the problems you can turn on Process Monitor from sysinternals. It will dump every file and registry interaction that every process on the system is doing in real time. It can even be configured to load at boot and capture everything until you run the GUI next (be warned this is A LOT of data, so it's only advisable if you can replicate the problem quickly after boot)

There are a bunch of rabbit holes that an root cause analysis can take you down, feel free to let us know on how it goes.

Adam Brand · Answer 5 · 2009-07-04T06:49:29+08:00

Adam Brand

2009-07-04T06:49:29+08:002009-07-04T06:49:29+08:00

If it is blue screening, check out the minidump file: http://support.microsoft.com/kb/315271

... this will tell you (usually) the driver or piece of software that caused the crash.

2

Gravitas · Answer 6 · 2009-07-07T13:49:31+08:00

2009-07-06 - I'm thinking its the hard drive.

I did a chkdsk, and it crashed with the same symptoms as before half way through the chkdsk. I'm using a Solid State Drive (SSD), the "PQI DK9128GD6R000A03 128GB SATA 2.5" SSD", with a MTBF of 1,500,000 hours. Despite having a MTBF of 133 years, it seem to have died after 2 weeks or normal use! To check my theory, I copied the VMware files to a standard hard drive. Ran chkdsk, and it worked like a charm. I'll see if the system survives a week of uptime, and if it does I can officially defenestrate my PQI SSD.

2009-07-07 - System crashed again. Back to the drawing board.

2009-07-08 - Rolled back a further 20 days to before I installed the SSD. We'll see if it crashes again (it did).

2009-07-09 - uninstalled OpenVPN, upgraded to the latest version of Skype, upgraded to SQL 2008 to SP1, removed TeamViewer. We'll see if it crashes again (it did, in the middle of an Acronis backup).

2009-07-09 - suspect that the amount of virtual memory available the VMware machine that runs the server is too small, I've got it at 4GB at the moment. Increasing it (this had no effect).

2009-07-09 - discovered that if the VMware container running Windows Server 2008 crashes with 100% CPU utilization, and I pause/restart it, then it uncrashes and resumes operation! This tends to point to a problem with VMware or its host OS (which is XP), rather than a problem within the Windows Server 2008 itself. Getting very close to the heart of the problem now.

2009-07-09 - Windows Server 2008 only crashes when the host OS is under very heavy load. Increased the number of CPU's it can utilize to 2 CPU's, this seems to have fixed the problem.

In conclusion:

Original problem was caused by a bad hard drive with bad sectors (it was actually a 128GB SSD from PQI - wouldn't expect a Solid State Drive (SSD) to fail two weeks after purchase but this one did).
Next problem was caused by the host OS that was running VMware coming under high load. Fixed this by allocating more RAM and increasing the size of the page file.
If it happens again, I have a workaround (just pause/restart VMware v6.5 to "unfreeze" Windows Server 2008 running inside of it).

Problem solved, thanks guys!

Maxwell · Answer 7 · 2009-07-04T04:26:51+08:00

Maxwell

2009-07-04T04:26:51+08:002009-07-04T04:26:51+08:00

Could you please explain what do you mean by crash, is the server encountering BSOD or is it just hanging at 100% Cpu.

For troubleshooting, you can make the server log to a syslog server, run at intervals a script listing processes an their resources usage, writing his output to a network share.

If the server makes bsods try googling for the error code mentioned in the bsod.

Also, maybe the acronis uninstall left an error log with some informations in the installation folder.

1

Gamecat · Answer 8 · 2009-07-04T04:04:30+08:00

Gamecat

2009-07-04T04:04:30+08:002009-07-04T04:04:30+08:00

Does it crash exactly each 24 hours (on the same time each day)?

If so, there is possibly a scheduled process that causes the crash.

1

How to track down the cause of Windows Server 2008 crashing?

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?