We have set up an inexpensive physical server with a bunch of 3TB disks to use as a backup staging area before we push to tape. We've installed Windows Server 2012 R2 and set up Storage Spaces/Pools. We backup using Veeam to a faster server running on fibre channel, and then use scripts to move backups that are older than x number of days to our Storage Spaces server.
We had some failures originally as we found using Robocopy to move the data by UNC path didn't gracefully close out the SMB connection. we resolved this by adding net use and then net use /delete to the script (and then using the drive letter as the Robocopy target). This worked beautifully for the last week or two.
This morning though the scripts reported failure. Upon investigation I found a series of event ID 51 warnings, followed by event ID 134 (from source ReFS). This looks to me like a physical disk in the storage pool has failed. However, looking in Server Manager, it showed virtual disk/volume/not quite sure what to call it as 'offline'; simply bringing it back online worked, and there are no failed physical disks in the Storage Pool. There are also two hot spares, and neither of these have been swapped in.
I'm curious as to what happened here? And also why did the volume go offline? I thought the whole point of ReFS and Storage Pools was to provide reliance in the event of these kinds of failures?
EDIT: Adding all relevant logs below.
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="disk" />
<EventID Qualifiers="32772">51</EventID>
<Level>3</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2014-12-23T22:13:12.704827200Z" />
<EventRecordID>23901</EventRecordID>
<Channel>System</Channel>
<Computer>****</Computer>
<Security />
</System>
<EventData>
<Data>\Device\Harddisk25\DR25</Data>
<Binary>040080000100000000000000330004802D0100006B0400C000000000000000000000000000000000FC8F470200000000FFFFFFFF0100000058000030020000000020101280032040000080003C000000000020AB09E0FFFF783583D201E0FFFF0000000000000000507383D201E0FFFF30C99FC108E0FFFF6B0400C0000000008A00000000027C288D60000008000000000000000000000000000000000000000000000000000000</Binary>
</EventData>
</Event>
An error was detected on device \Device\Harddisk25\DR25 during a paging operation.
FYI Disk25 is the virtual disk created by storage spaces, not one of the physical disks
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="ReFS" Guid="{036647D2-2FB0-4E32-8349-3F5C19C16E5E}" />
<EventID>134</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2014-12-23T22:13:13.329846900Z" />
<EventRecordID>23902</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="31267444" />
<Channel>System</Channel>
<Computer>*****</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="VolumeIdLength">2</Data>
<Data Name="VolumeId">D:</Data>
<Data Name="FailureReason">0xc000000e</Data>
</EventData>
</Event>
The file system was unable to write metadata to the media backing volume D:. A write failed with status "A device which does not exist was specified." ReFS will take the volume offline. It may be mounted again automatically.
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-StorageSpaces-Driver" Guid="{595F7F52-C90A-4026-A125-8EB5E083F15E}" />
<EventID>304</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2014-12-30T23:43:40.519688500Z" />
<EventRecordID>21</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="3723912" />
<Channel>Microsoft-Windows-StorageSpaces-Driver/Operational</Channel>
<Computer>****</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="Id">{DE94C7EF-6A25-11E4-80B7-647002019326}</Data>
</EventData>
</Event>
The virtual disk {de94c7ef-6a25-11e4-80b7-647002019326} is in a degraded state. This can happen when a physical disk hosting the virtual disk fails, is disconnected, or experiences a write error.
Windows will attempt to repair the virtual disk. No action is needed at this time.
Assuming you are definitely using a fault-tolerant mode such as parity or mirror, then that error should not be possible. I was able to reproduce that error in a striping setup with a disk I have that I know is bad. So either you're set up for striping, or you found a bug. I would involve Microsoft at this point, if you haven't already.
After a lengthy email discussion with a Microsoft support engineer, we ended up installing the following rollup update:
http://support.microsoft.com/kb/2887595
This includes an update which seems to specifically target this issue:
https://support.microsoft.com/en-us/kb/2897150
Since installing the rollup update, the volume has consistently remained online without any issues.