I am running a (Linux based) rsync server for software distribution. A (Windows based) source repository server which is outside my control pushes software packages to it via rsync, and about a hundred satellite servers worldwide pull from it, also via rsync.
The source repository contains many big duplicate files. I want to reduce disk space and bandwidth consumption on the satellite servers by replacing those duplicates by hardlinks. The administrator of the source repository is unwilling or unable to do so at the source, so I'm trying to do it after the fact on the distribution server. I have created a simple bash script based on the fdupes
command which finds groups of duplicates and replaces them with hardlinks to a single file. The rsync transfers to the satellite servers preserve these hardlinks as desired thanks to the -H option. The transfer from the source repository however produces inconsistent results. Sometimes the deduplication is preserved. Sometimes the source server retransmits all of the files of a deduplicated group and the deduplication is broken even though the source files did not change.
Hence my question: What is the official behaviour of rsync in case it is asked to sync two identical but separate files and the files do already exist in the destination with the correct content, but as hardlinks to the same file? What is the exact criteria for retransmitting a file? Is there a way to ensure that the hardlink in the destination is preserved in that situation even though the hardlink does not exist in the source?
tl;dr: To preserve file level deduplication via hard links at the destination, run
rsync
with the--checksum
option.Full answer, according to a series of experiments I did:
If two files are not hardlinked at the source,
rsync
will sync each of them individually to the destination. It does not care whether the files happen to be hardlinked at the destination. If one of the files (or both of them) ends up being retransmitted, the hard link at the destination will be broken, otherwise it will be untouched. That is, even with the--hard-links
option,rsync
will not break a hardlink at the destination just because the files are not hardlinked at the source.The criteria for retransmitting a file depend on the
--checksum
(-c
) and--ignore-times
(-I
) options.--checksum
is given, only files that differ in size or checksum between source and destination are retransmitted. Consequently, if the file content hasn't changed then a hard link at the destination will be preserved even if it doesn't exist at the source.--ignore-times
is given, all files are retransmitted, breaking any hard link at the destination that doesn't exist at the source.rsync
will use the modification timestamps of the source and destination files for its decision. In that case, if the timestamps of the two source files differ, a hard link at the destination will always be broken because only one of the two timestamps can match.It preserves source hard links if you use the -H or --hard-links option
That will not create hard links -- you'll have to do that after the fact by looking for files with the same checksum, deleting one, and adding a hard link to replace it. After all, you wouldn't want rsync making every content duplicated file a hard link to the same file. Imagine if every 0 length file was a hard link -- you add content to one, you've changed the content for all.