I have a Windows Server 2003 box on my small network. In it is a Promise Fasttrak RAID controller and two parallel ATA Western Digital drives in a RAID-1 (mirror) configuration. When I set this up I expected that it would be a reliable storage system, and the RAID controller would tell me when there's an issue so I could react.
However, it is failing on both counts now. When I copy files from that server, I find large files have been corrupted. For instance, I was recently copying the XP SP3 network install (~320MB) to another PC. The extraction failed. I thought that odd since I'd used that executable before. So I copied it from the network again twice, and using FileAlyzer, I discovered that the MD5 & SHA1 hashes of the 3 different copies varied. I performed similar tests from other PCs on my network, and I could replicate the fault. Worse, the RAID BIOS never complained about anything being wrong! Which leads me to believe the controller itself may be bad. (Note: I don't think it's the network, since other PCs can reliably copy files to each other.)
But my question is: What kind of tools exist for Windows to "certify" a file system is behaving reliably, RAID or otherwise?
For instance, I purchased a tool called GoldMemory to run an exhaustive memory test when I build a new PC. I won't trust a new PC until it survives 24 hours under GoldMemory with no memory errors. I also purchased Steve Gibson's SpinRite to test individual ATA disks.
Is there a tool I can run within Windows to test an NTFS file system, whether RAID-based or not, that will repeatedly read and write and check for corruption?
I can't trust my current server as is, and if I swap out components to try to repair, or else build a new system, I'd like to be reasonably sure my file systems are operating reliably before betting the farm. While I'd like to trust that a brand-name RAID controller and decent hard drives would be reliable, I now need to take a Horatio Caine approach: "Trust, but verify."
Thanks for your help! :-)
UPDATE:
So, I ran some local tests on the server (within cygwin) to rule out the network as the problem. This should give you an idea of what I am contending with. The problem happens most of the time with BIG files. (The one below is 462MB.)
$ md5sum VMware-workstation-6.5.2-156735.exe
7bf6145eb7d3e4fbcc945d87017fb6bd *VMware-workstation-6.5.2-156735.exe
$ for (( c=1; c<=50; c++ )); do md5sum VMware-workstation-6.5.2-156735.exe; done
545c2f8e9363823af3aa703a1cbd35e3 *VMware-workstation-6.5.2-156735.exe
b47d4aa75aae27264cfd6396fbfe646a *VMware-workstation-6.5.2-156735.exe
b47d4aa75aae27264cfd6396fbfe646a *VMware-workstation-6.5.2-156735.exe
... etc... (repeats)
$ for (( c=1; c<=50; c++ )); do md5sum VMware-workstation-6.5.2-156735.exe; done
9d2fbb3fa46194f6915d6328f0881a24 *VMware-workstation-6.5.2-156735.exe
9d2fbb3fa46194f6915d6328f0881a24 *VMware-workstation-6.5.2-156735.exe
... etc... (repeats)
$ for (( c=1; c<=50; c++ )); do md5sum VMware-workstation-6.5.2-156735.exe; done
512181c3838e91a02a92280462e2f4c3 *VMware-workstation-6.5.2-156735.exe
512181c3838e91a02a92280462e2f4c3 *VMware-workstation-6.5.2-156735.exe
...(repeats a dozen or so times, then changes!)
7a84da59a83f203506244e23507bb4df *VMware-workstation-6.5.2-156735.exe
7a84da59a83f203506244e23507bb4df *VMware-workstation-6.5.2-156735.exe
... aargh!
It should be easy to setup a shell script that repeats copying a file on the server and recalculates the checksum of each copy. After it fills your sever, you check all the checksums by hand.
My experience is that raid controllers that have Promise written on the outside are broken on the inside. Get rid of it. Sometimes even the Promise controllers only make a driver driven software raid. Try Areca or so.
If you plan for raid, put a pricetag on your data. Then put a pricetag on not being able to work a few days. Then check for prices of good raid controllers.
You don't need to cough money for ram testing tools, because memtest86+ rules and is free. To test the filesystem integrity you could use afick, it works fine for me (but I didn't used it much on windows, though).
What is the make of your drives? a priori, I'd suspect the Promise card. They have a very long and painful history of absolutely shitty products, with abysmal performance, data corruption, buggy drivers, and various combinations of all of these.
Are you sure it's the RAID controller? I've experienced similar problems that had to do with the network drivers / card failing.
You say that other PCs can copy files to each other, but that doesn't mean the server network card (or driver) isn't fritzy.
Chkdsk has always been my first-line tool for fixing NTFS. Comes in the box, and works like a charm. Full disclosure: I very seldom have the need to verify filesystems, so I've never needed another tool.
Out of the 100 or so servers I manage I've needed to use it once and that time was because of problems caused by a bad data sync on the SAN, not the RAID card. I'm with everyone else saying ditch the Promise card and get something better.
Try the Robocopy to copy large files.