I have a Dell PowerEdge 2850 running Windows Server 2003. It is the primary file server for one of my clients. I have another server also running Windows Server 2003 that acts as the core media server for Symantec Backup Exec 12.
I recently upgraded from Backup Exec 11d to 12. This upgrade was necessary because we also just upgraded from Exchange 2003 to Exchange 2007. After the upgrade I had to push-install the new version 12 Backup Exec Remote Agents to each of the servers I am backing up (about 6 total). 5 of my servers are doing just fine, faithfully completing backups every night. My file server routinely crashes.
Observations:
- When the server crashes, it does not blue screen, it just locks up completely. Even the mouse is unresponsive. If you leave the server locked up long enough, it will eventually reboot itself and hang on the Windows splash screen.
- There is absolutely zero useful Event Viewer evidence of a problem. The logs go from routine logging to an Unexplained Shutdown Event the next morning when I have to hard reset the server to get it to boot.
- 90% of the time the server does not boot cleanly, it hangs on the Windows splash screen. I don't have any light to shed here. When the server hangs all I can do is hard reset it and try again. Even after a successful boot and chkdsk /r operation, if you reboot the machine, you have a 90% chance it won't back up again cleanly.
The back story:
This server started crashing during nightly backups about a month ago. I tried everything I could think of to troubleshoot the problem and eventually had to give up because I could not keep coming to the office at 4 AM to try to get the server back online. One Friday I got lucky and the server stayed up for its entire full backup. I took this opportunity to restore the full backup to a temporary server I set up and switched all my users to the temporary. Then I reloaded the ailing file server.
I kept all my users on the temporary file server for about 3 weeks. I installed the same Backup Exec Remote Agent and Trend Micro A/V client on the temporary server that I was using on the regular file server. During this time, I had absolutely no problems backing up the temporary server.
I tested the reloaded file server extensively. I rebooted the server once an hour every day for 3 weeks trying to make it fail. It never did. I felt confident that the reload was the answer to my problems. I moved all of the data from the temporary server back to the regular server. I got 3 nightly backups out of it before it locked up again and started the familiar failure to boot cleanly behavior.
This weekend I decided to monitor the file server through the entire backup job. I RDPd into the file server and also into the server running Backup Exec. On the file server I opened the Task Manager so I could view the processes and watch CPU and memory usage. Everything was running smoothly for about 60GB worth of backup. Then I noticed that the byte count of the backup job in Backup Exec had stopped progressing. I looked back over at my RDP session into the file server, and I was getting real time updates about CPU and memory usage still - both nearly 0%, which is unusual. Backups usually hover around 40% usage for the duration of the backup job.
Let me reiterate this point: The screen was refreshing and I was getting real time Task Manager updates - until I clicked on the Start menu. The screen went black and the server locked up. In truth, I think the server had already locked up, the video card just hadn't figured it out yet.
I went back into my bag of trick: driving to the office and hard reseting the server over and over again when it hangs up at the Windows splash screen. I did this for 2 hours without getting a successful boot. I started panicking because I did not have a decent backup to use to get everything back onto the working temporary file server.
Once I exhausted everything I knew to do, I took a deep breath, booted to the Windows Server 2003 CD and performed a repair installation of Windows. The server came back up fine, with all of my data intact. I can now reboot the server at will and it will come back up cleanly. The problem is that I'm afraid as soon as I try to back that data up again I will back at square one.
So let me sum things up:
Here is what I've done so far to troubleshoot this server:
- Deleted and recreated the RAID 5 sets. Initialized the drives. Reloaded the server with a fresh Server 2003 install.
- Confirmed with Dell that I have installed the latest, Dell approved BIOS and NIC drivers.
- Uninstalled / reinstalled the Backup Exec Remote Agent.
- Uninstalled the Trend Micro A/V client.
- Configured the server not to reboot itself after a blue screen so I can see any stop error. I used to think the server was blue screening, but since I enabled this setting I now know that the server just completely locks up.
- Run chkdsk /r from the Windows Recovery Console. Several errors were found and corrected, but did not help my problem.
Help confirm or deny the following assumptions:
- There are two problems at work here. Why the server is locking up in the first place, and why the server won't boot cleanly after a lockup.
- This is ultimately a software problem. The server works fine and can be rebooted cleanly all day long - until the first lockup - following a fresh OS load or even a Repair installation.
- This is not a problem with Backup Exec in general. All of my other servers back up just fine. For the record, all of the other servers run Server 2003, and some of them house more data than the file server in question here.
Any help is appreciated. The irony is almost too much to bear. Backing up my data is what is jeopardizing it.
The hanging at the Windows splash screen makes me pretty suspicious of your RAID controller firmware or drivers. Is it a Dell PERC? Are you current on firmware and drivers?
Is there anything special about the last few files and directories that are being successfully backed-up (i.e. something uncharacteristic of the files up to that point in the backup)?
You could turn on debug logging in the Backup Exec remote agent on the file server, though if the filesystem or disk driver is falling down and crashing you probably won't get a debug log written. Stop the remote agent service and start it with the "-debug" parameter specified in the "Start parameters" text-box on the service properties (assuming you're using the "Services" MMC snap-in to do this starting / stopping). If you're prefer the "-debug" setting to be permanent, add it to the ImagePath value in "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BackupExecAgentAccelerator".
Posted November 2011 - Try this:
1) Right-click file C:\program files\symantec\SYMEVENT.SYS and choose Properties > Version (tab) and notate the version info.
2) Download the SymEvent installer / updater: ftp://ftp.symantec.com/public/english_us_canada/symevnt/Sevinst.exe
3) Update SymEvent, as per the following article: http://www.symantec.com/business/support/index?page=content&id=TECH98521
Excerpt:
To update the Symevent files on Windows 2003/XP/2000/NT (Including Server versions):
A. Download Sevinst.exe from the Symantec FTP site. Save the file to a folder on the hard drive.
B. Open a command prompt, and change to the folder where you downloaded the Sevinst.exe file.
C. Depending on the program version, do one of the following:
On computers that run Symantec AntiVirus 9.x or later, type the following command:
sevinst.exe /log SAVCE
On computers that run Symantec AntiVirus 8.x or earlier, type the following command:
sevinst.exe /log NAVNT
D. Restart the computer
The only things that jump to mind that you didn't mention testing are RAM and system load levels.
RAM should be easy, but I'm not sure if there's anything about backup that would cause use of a bad area that wouldn't be triggered in regular use - it just doesn't fit.
The other thing is load levels on the hardware. When backing up, it's going to be moving a lot of information both from disk and through the NIC.
You already have one suggestion of checking the RAID controller; I'd add to that checking it by doing some high-volume transfers attempting to simulate the load of the backup. Also, does it die at the start of the backup or after some period of sustained throughput?
For the NIC load, I'd try a few things - another NIC, forcing it down to 100MBit, pushing large amounts of data through it (again, to simulate backup load).
The biggest headaches with testing those may end up being in testing them independently. I'd start with the NIC(s) as the easiest item to test. If you can throw one or more additional drives into the system independent of the RAID controller, that may give you a good way to isolate whether the RAID controller itself is the source of the problem - copy everything to the non-RAID drives and see if you can back those up cleanly.
For the continuing/repeating lockups after the first - does completely removing power from the system resolve the issue? Remember that a powered-down server isn't fully off - in particular the network interface may well remain live for wake-on-LAN. If some internal state in the hardware is incorrect, just restarting may not actually clear it.
I've had a similar problem with Backup Exec(albeit, much older version 10) I installed the latest update and my server started BSODing randomly at or little after scheduled backup. I never determined the exact cause of the problem, but it seems to all be somehow related to TrendMicro too and all together it caused memory protection faults.
My solution was to revert back to oler Backup Exec version as well as update my TrendMicro(if you use officescane, there's a new major release that came out recently).
I would suspect drivers issue. Just a similar experience. A legacy application uses ISDN modem. I moved it to a new computer and downloaded latest modem drivers.
ISDN connection kept on dropping and I thought it was the modem/the line... but after all searching I replaced the newest drivers with 6(!) years older and since then it works without problems. So latest drivers aren't always the best - don't fix if it ain't broken.
Good luck!
This may be an open file issue, and the open file may be getting corrupted. Try backing everything up EXCEPT the windows (and below) directories. See if backing up just data freezes the sucker. Also, if you have the disk space do a disk to disk backup with NT backup, then backup that file to tape. Make a current rescue disk. Also manually backup the AD files.
If it backs up data without hanging, it's an open system file issue. If it still blows up, unless you run exchange or SQL server, I would suspect drivers or possibly hardware.