I want to compare two large binary files which are stored on different Linux machines with limited bandwidth between them and then back up only the blocks which differ – on the command line. To simplify the task, we can assume the files are not going to change during the comparison process, and the files are the same size.
This is almost like what I believe rsync
does, only I don't want to modify the target file – I want to keep diffs which I can apply to the base image so I can recreate a copy at various points (ie when the diffs are taken).
I'm also aware of xdelta
, but that appears to only compare files on the same machine.
The "process" I roughly envisage (hopefully all done by a script/program) might be -
- (On each machine) produce a list of hashes for each block.
- Compare the 2 sets of hashes.
- Produce a file which pulls only the changed blocks in the source in such a way as they can be "merged" back with the target file.
Is anyone aware of a program, script or elegant method to do this without me having to cut code?
I recommend examining rsync's batch mode. The
--only-write-batch
option in particular seems to accomplish your goal.Efficient comparison usually requires comparing files on the same machine because if you want to account for offsets, you need to do lots of range-checking in the process. For example, if I added one character in a text file, everything after that character might need to be transmitted as "new" in a simple check.
One such example of a very simple checking method is implemented by Bit Torrent. Each block has a checksum, and each file is made of a series of blocks. Blocks might span the end / start of two or more files, but block verification will also check those spans. Only the blocks that don't match the description of the file will be changed. Thus is you start a client with some of the files accurately written and some differing (by corruption or change), only the necessary blocks to fix the difference will be transferred. Block size is configurable per torrent description file by powers of 2, and there are tons of open source clients you can grab this code from.