We set up a new server here a few weeks ago that I am informally responsible for managing.
Almost everything works perfectly except for one thing: Every so often it hangs without warning.
Some facts about this hang:
- It is not a single application or service; the entire system is non-responsive.
- Nothing is displayed (monitor acts as though there's no VGA signal).
- The power LED is on and the fans are running.
- Pressing the power button does nothing (normally it would shut the machine down).
- Pings generally time out; once it did respond, another time I got "destination host unreachable".
- Event logs show nothing (literally nothing at all) from before the hang until the hard reboot.
- There are no performance problems, strange errors, or other obvious signs of impending doom leading up to the eventual hang.
- The machine is generally not heavily loaded (it's for development, not production), and the hangs appear to be occurring at non-peak times of day (between midnight and 6 AM).
Some additional facts about the machine/environment:
- Windows Server 2008 R2
- Running SQL Server 2008 and IIS (not much else)
- All drivers up to date, patches installed, etc.
- No vendor-supplied diagnostics (not "top tier").
- The machine is completely new, not merely reformatted or repurposed. No recent changes although the machine is less than a month old to start with.
I don't expect any easy answers here. What I'd like to know his I can methodically determine the root cause of this problem, be it a misbehaving service, defective hardware, or something else.
Is there any kind of logging I can set up that will help me get to the bottom of this? Any hardware diagnostics or remote monitoring? Anything else I can do to help me discover what's actually happening, or at least be able to eliminate what isn't wrong?
Just to reiterate, I really don't want to start speculating about possible causes and take a trial-and-error approach, because it's going to be at least several days at a time before I would have conclusive results. I'm looking for solutions to reliably trace the problem to its source.
good place to start
http://blogs.technet.com/b/askperf/archive/2007/09/25/troubleshooting-server-hangs-part-one.aspx
With nothing in the logs at all, and no way to reproduce the problem, you've got a lot less to go on, so it will be tougher to be methodical as you are requesting.
If this is hardware from a top-tier vendor, run their diagnostics. IBM, Dell, HP all have diagnostic suites - and free monitoring suites, as well (Director, SIM, and OpenManage, respectively.)
Chronologically, when did this start happening, and did anything change in or near this server before that point? New hardware installed (and/or drivers), update to AV software, new RAM? You said it's a new server - is it new to you, or new to the organization entirely?
Can you P2V it in a sandbox and see if the problem persists?
Is it possibly related to increased load - can you cause it to happen, or take a guess (or show some graphs) to see if more people are using it at the times it happens?
It's pretty paradoxal, you say you have no hardware diagnostics but you want a methodic way to proceed... hardware diagnostics is the methodical way to proceed for hardware faults.
Otherwise if it's a low level software fault there might (should?) be a memory dump somewhere and Microsoft would provide some tool to analyse it, although they don't provide much documentation to understand low level processes so it might be a dead end.
Might, should would... it's been a long time I experimented with such stuff! The problem is usually that you're dealing with closed source so you're virtually on your own!
Maybe support from Microsoft?