We need to regularly transfer large (60GB) Hyper-V virtual machines images around our WAN (UK to USA) over 50Mbps leased lines. We also use DFS-R between the sites. Historically, I've used 7-zip to zip up the virtual machine (into smaller 100MB chunks) and then dropped the files into a DFS-R transfer folder. When the backlog clears, unzip at the other end.
I wonder if I'm wasting my time and might as well drop the entire VM (VMDX files mainly) in the transfer folder and let DFS-R compress it during the transfer.
So the question is - how efficient is the DFS-R compression algorithm compared to 7-zip's native 7z format? 7-zip packs the image down to about 20GB so a 70% saving.
I get the feeling that the extra time to pack and unpack outweighs any possible higher compression ratio in the 7-zip algorithm. That said, transferring 100MB chunks feels "better" than one big 50GB VMDX file.
DFS-R uses something called Remote Differential Compression.
Instead of comparing and transfering an entire file, the algorithm will compare the signature of sequential chunks of data between the source and the target replica. This way, only differing chunks of data needs to be tranfered across the wire, in order to "reconstruct" the file at the target location.
As such, RDC is not really comparable to the compression algorithms used in 7-zip. Although they use similar techniques (building signature dictionaries over ranges of data), the 7-zip algorithm is designed to rearrange bytes into a lossless container format where all data is "squeezed" together, where RDC's purpose is to identify differences between similar files or file versions, in order to minimize the volume of data transfered in order to keep the replicas in sync
If you already have similar VMDX files at the target location, there's no need for splitting the file into 100MB chunks. Just be sure to always use the same compression algorithm(s) when zipping the images
This behavior (comparing similar files, not distinct versions of the same file, and extracting chunks) is known as "cross-file RDC" and the publicly available documentation is pretty sparse, but the AskDS blog team has a short but pretty good clarification in this Q&A post
As Mathias already noted, DFS-R employs the "remote differential compression" algorithm similar to rsync's to only transmit the changed / appended portions of a file already present on the remote side. Additionally, the data is compressed before transfer using the XPRESS compression algorithm (Reference: Technet blog) since the very first appearence of DFS-R in Server 2003 R2. I could not find any details on the actual variant of XPRESS used, but since the compression has to happen on-the-fly, it might be using LZNT1 (basically LZ77 with reduced complexity) as this is what is used in NTFS for the very same purpose.
If you want to monitor the compression ratios, consider enabling DFS-R debug logging and evaluating the log files.
The compression ratio for any of the EXPRESS algorithms is likely to be lower (probably even by a factor as large as 2) than what you get with 7zip, which has algorithms optimized for file size reduction, not CPU usage reduction. But then again using RDC, which allows for transmitting only changed portions of the file, you are likely to get significantly less data over the wire than your 20 GB archive.
Pre-creating a 7zip archive to be transferred with RDC might seem like a good idea to get the best of both worlds - only transmit changes but with a higher compression ratio for the changed portions - but isn't. Compression would mangle the entire file and even a single byte changed at the beginning of the data stream would cause the compressed file look entirely different than before. There are compression algorithm modifications to mitigate this problem, but 7zip does not seem to implement them so far.
All in all, you are likely to significantly save on bytes transmitted over the wire when using DFS-R to transfer file modifications. But it is rather unlikely you are going to get any savings in time and you are inducing significant I/O and CPU load on the destination as well as the source as both files (source and destination) need to be read and checksummed before the actual transmission can start.
Edit: if you have new files, RDC indeed would be of little help - there is no counterpart to rsync's
--fuzzy
parameter which would look for similar files at the destination and take them as a baseline for differential transfers. If you know you have a similar file (e.g. a baseline image of the transferred VM HD), you could pre-seed the destination directory with this one, though.