We have a server 2003 R2 standard (which I'll refer to as SRV01) that's knocking on a bit now, but it still acts as a file, print and SQL server on our company's network. SRV01 hosts user profiles, home directories and pretty much all our business data. Note our AD is currently at 2008 R2 level.
This server is due to be upgraded in the next 12 months, but I've no budget to spend on it just yet.
A bit of history of this server follows:
When SRV01 was first commissioned, it acted as a domain controller (with the same 2003 R2 install it has today), paired with another server that ran Server 2003 R2 SBS.
A few years ago, we purchased a pair of dedicated DCs (2008 R2) and at this point we decommissioned the 2003 SBS server, and SRV01 was DCPROMOed out of the AD.
Up until very recently, SRV01 used to run Exchange 2003, however we've recently purchased a dedicated server for Exchange 2010 and upgraded (following Microsoft recommended upgrade path). Exchange 2003 was recently uninstalled. - Cleanly to the best of my knowledge.
Ever since Exchange was removed from SRV01, I'm finding that after a few days of uptime, when I attempt to logon, pressing CTRL-ALT-DEL just hides the Welcome to Windows Server 2003 banner, and never presents the logon dialog. All I see is a moveable mouse pointer and a blank background.
It's a similar story with an admin TS session, the RDP client connects and gives me a blank background, but no logon dialog is presented. The RDP session indefinitely hangs until I give up and close it.
The only way I've been able to gain access to the server is to pull the plug on it. Whilst the server does have a battery backed up RAID 5 controller, I'm unhappy about having to do this, so as a temporary measure, I've created a scheduled job to reboot SRV01 each night.
Not only do I not like the idea of scheduling a reboot of a server like this, but it is also causing problems for users that leave desktop PCs left logged on overnight. Users complain of 'Delayed Write Failures', and there has also been a number of users that have started to complain about account lockout problems, as well as users not able to connect to shares on SRV01 until they reboot their desktop PCs.
I've examined event logs on SRV01 and on the DCs looking for clues as to what the problem is, but there really is nothing untoward being logged. How could I being to investigate this problem when nothing of any relevance is being logged? Is there some additional logging that can be enabled that might give some clues as to what could be causing this problem? Could performance monitor help me out here, and if so, what counters would you consider monitoring?
It's worth mentioning that whilst the server is unresponsive via the console and TS, it does still respond to clients connecting to shares without problems for several days, but after about a week I then start to hear users reporting problems accessing shares, but this seems quite sporadic.
I've also tried leaving the console logged on (and locked), when I notice I can no longer logon via TS, I can unlock the server console without problem, but it refuses to reboot/shutdown, and subsequent attempts to reboot report that a system shutdown is already in progress and the system then completely hangs.
I've tried playing the waiting game for several hours thinking that a timeout might allow the shutdown to continue, but to no avail.
I'm guessing this is x86 (32-bit). I would be inclined to run the Windows Debugger on the next occurrence, and display the amount of memory used. In particular, the system kernel memory (paged pool and non-paged pool).
If you have the Windows Debugger copied to a folder, run windbg.exe, and the command to enter is:
!vm
What you may find is that paged or non-paged pool is depleted, and possibly set too low. Out of the box, the settings for kernel memory on Windows 2003 x86 are ridiculuously low and easily depleted.
You should also verify that you do not have the /3GB switch set in boot.ini - that just makes the kernel memory depletion issue worse.
This may also point to some offending driver that is consuming kernel memory, such as a network driver.
If your pagefile on C:\ is large enough to hold all of physical memory, you can also force a blue screen with a system setting. The resulting memory dump can be examined in the debugger. Forcing a blue screen is useful if you cannot get the debugger to run at all.
Debugging Tools for Windows
http://msdn.microsoft.com/en-us/windows/hardware/gg463009.aspx
Forcing a System Crash from the Keyboard
http://msdn.microsoft.com/en-us/library/ff545499%28v=vs.85%29.aspx
I'm inclined to go with what @Greg Askew says-- it sounds like a classic kernel memory pool exhaustion scenario.
I'd take the route of using the
poolmon.exe
tool from the Windows Support Tools rather than the debugger. Windows Server has pool tags enabled out-of-the-box and thepoolmon
tool is fairly easy to use. In another Server Fault question I talk a bit about interpreting the output. I've also had good luck diagnosing these situations (particularly handle leaks, which are similar) using the Performance Analysis of Logs tool, too.