We are currently working on implementing a DR strategy for a windows file server. We have ruled out Storage Replication because it is a preview feature, and Failover Clustering is designed for high-availability, not DR. DFSR also has deficiencies in replicating open/locked files, making it unideal for the task.
SAN to SAN replication of the file server VM seems to be the best method to me, though I've been cautioned against that due to the fact the replication is a raw copy that is not coalesced at a higher level, possibly causing inconsistencies in the filesystem or corrupt files. However, this fact is true of any server replicated in this method, and this is the method being used for other servers in our DR plan. VSS/Previous Versions could always be used to restore any corrupt files also.
Do the benefits of doing SAN replication outweigh the risk that files may be corrupt? Or is there a better method of doing DR for a file server? Perhaps there's a product that performs a higher-level replication/snapshot that minimizes logical inconsistencies in the data?
Note: the cluster is running vSphere 5.5
SAN to SAN replication is your best bet for bringing the file server back online as quickly as possible with a little loss as possible after declaring a disaster. Please note that this type of DR protection doesn't protect from the same things as local backups- you can't use a replicated SAN volume to, for example, undelete a file from last month.
Corrupted files are not a danger of SAN to SAN replication unless it's the file server on the main site that corrupts them. Every SAN that provides replication of block-based storage (LUNs) has some mechanism to prevent corruption and guarantee consistency. It's a trickier problem than most people know because writes are often applied to disk out of order, even without replication, for optimization reasons. This is why the write cache for most storage has some sort of power failure safety net (like a battery or a UPS): without the writes only saved in cache, the underlying disk is likely corrupt. Normally this is ok, however if you lose power, you need to ensure that the last write acknowledged by the storage is saved to disk in order to make the disk consistent when it comes up.
Replication handles this differently depending on how you're replicating:
All these mechanisms provide you "crash consistency". The disk is in the same state it would be if you turned the power off on the server abruptly. It takes a little bit of work to get filesystems and databases running from a crash consistent copy, but it's always doable. If you want something more (that "higher level" you mention in the question), you need to integrate your replication with your applications. This normally means pausing writes on the application, waiting until everything has been destaged to the storage, then kicking off a consistency point for replication. This is called "application consistency". It will generally deliver a slightly older recovery point, but a slightly lower recovery time than crash consistency.
You need to be prepared for multiple levels and kinds of disasters, including a total malicious breach (hackers), and a total loss of all hardware (epic weather). This will require that you do offload some data to sneaker-net distribution methods (Read that, external storage such as tapes / hard drives), some form of a write-once only solution, or an online backup service (expensive).
Disaster recovery is a different beast than simple replication. You need to determine this before you decide anything: "How much data can I lose?" Don't think in terms of Gigabytes, think in terms of TIME. Can I lose 4 hours worth of data, can I lose a day's? The method you choose will depend on your answer to that question. We all want a solution that has zero loss, but that is generally not a feasible investment for the risk that is being mitigated. You'll also need to keep copies of your monthly / annual backups for a good while, as you can also have disasters occur (users delete crap they need) that you are unaware of for an extremely long time.
SAN to SAN replication is the fastest way to recover a site disaster, but I lived a SAN corruption in my IT life due to a firmware bug and it can get ugly
You forget to write what hypervisor you use. But I suggest with the SAN replication the vReplicator product if you use ESX. That replicate at each 15minutes by default and your remote VM is in a ready to get up's state. vReplicator need a vCenter license and a physical host to hold the replicated VM.(can cost less than another SAN, but like @IceMage 's told, that depend on how much time you can loose)
Veeam and other backup products using snapshots go against VMware best practices to not do them that often. It will bring the servers to its knees and almost unresponsive. Imagine 50 servers doing 15 minutes snapshots, 1200 snapshots in a day? Hard to manage, lot of storage. A CDP technology like Zerto solves this for VMware and Hyper-V.
I'd suggest using Veeam for low RPO replication of your file servers virtual machines. It's VSS-aware and can be used to replicate locally and to WAN and cloud targets, with multiple retention points.
Set up rolling 15-minute snaps, ship hourlies or dailies offsite. It's pretty robust for the cost.
If you have a remote Hypervisor, you can configure a partial run-book that brings up a replicated VM with appropriate network and IP settings.