We have a single standalone 2008 R2 server, running on ESXi 5.5, that provides file shares to clients.
Last night it mysteriously stopped allowing inbound share access. I'm assuming it was blocking/dropping SMB but the problem is, and what concerns me, is that there is no record of what was happening.
As the calls started to come in, I was able to ping and RDP to the server successfully. At the same time attempting \\SERVER\Share
as a domain administrator simply hung before coming back that the share was unavailable.
Once I'd connected over RDPto the server, everything appeared normal: RAM & CPU use were low, all expected services were running. Event viewer showed literally nothing of use, no errors just the usual informational entries and a few errors where my RDP session had attempted to map unknown printers.
The security log, where I expected to see a load of 'windows filtering platform dropped packet', which is something we've seen before was clear, just the usual logon events and audit logs.
Connections outbound to other shares were fine, in short I couldn't see anything to fix.
Out of desperation I tried to restart the Server service, at which point the whole box froze and I had to press the (virtual) power button until it powered off. It came back up fine (thank god) but I'm confused.
My client is asking the obvious questions and so far I have, embarrassingly, not been able to provide an answer.
Any thoughts? I have little hope of going back in time and locating a root cause for this issue but is there anything else could be doing as far as logging or future tests for these kinds of issues?
So you state you had to push the server to reboot manually. Quite hard in fact. And all signs point to the system running smoothly when you RDP’ed in, correct? But yet still the system choked for no apparent reason overnight.
First, you need to tell the client that sometimes there are no solid answers. Maybe there was a power surge? Maybe it was just a hiccup. I have had servers that have literally run for months without reboot suddenly choke. Why? No idea. A reboot clears things up. And sometimes the logs might help.
That said, my best guess based on the info you are providing is that there some hardware level problem at play. Could be RAM, hard drive, related hardware or something else. Heck even the CPU itself.
The best thing I would do in a case like this is to schedule a maintenance window off hours at some point to run a thorough hardware check of the system itself. If you don’t do that, you are running the risk that this issue might come up again or there is complete hardware failure.
This was eventually tracked down to the virtual Ethernet adaptor we were using. It was running on an E1000 adapter. I swapped this over to the VMNet 5 driver and the issue went away.
FWIW we were on ESXi 5.1