Essentially, I am looking for a GNU/Linux tool, that could recursively iterate through two directories, find file changes/additions/deletions; and for all changed files, output a diff. This can already be done by diff
efficiently for text files, but not for large binary files - I would like also efficient "diffs" between binary files included in the final patch (which, I gather, is known as Binary delta compression - Wikipedia as subset of Delta encoding - Wikipedia). Or, in other words, do something like described in this "wishlist" comment (from CommandLineSyntax - xdelta - Google Project Hosting):
It would be very nice if xdelta supported multiple files patching. Something like:
xdelta3 -r /path/folder1 /path/folder2 >allfilesrecursivepatch.xdelta
For recursive comparing all files on folder1 and folder2 and creating a single patch-file for all of them. And:
xdelta3 -r -d /path/folder1 <allfilesrecursivepatch.xdelta
For applying the patch on all files contained on folder1
This facility does not exist, judging by Issue 21 - xdelta - Recursive directory diff support - Google Project Hosting), although there are ways around: the issue page has several suggestions for script wrappers, but I'd rather keep things self-contained in a single tool.
Most important for me would be the patching a "live" filesystem directory part as shown above, due to my intended use case - described in more detail below, and illustrated with a bash
script that uses git
.
I'd want to be able to update a static website on a cheapish Linux host/webfarm, which only allows FTP transfer (so no rsync
and such) with a rather low transfer speed, and only allows PHP script execution. Typically I'd need to sync from local/client/home to server/webhost, but of course I do not want to upload 200 MB each time I want to update a page :)
I could use outlandishideas/sync · GitHub to "Sync directory contents over HTTP using PHP", but besides only syncing from server to local, also only whole files are send: "No attempt is made to send diffs; this is not rsync". Similarly, I could use GNU FTPsync; it could probably handle file creation, modification and deletion, but it has the same problem - only whole files will be sent.
In principle, git
could be used too - the below script generates directories testdir_old
and testdir_new
, and shows that git
can encode the differences between them (in this case being "removed 1024; added 1024; modified/added 19; modified inline 1200", or total 3267 bytes of change) as a "sneakernet" git bundle file 4470 bytes in size. But even if I could persuade the host to install git
up there, I'd still have to maintain a .git
repo up on the webhost for the bundle to apply cleanly - and I definitely do not want to do that, as I couldn't spare the extra file size usage; also, it seems that managing large binary files with git - Stack Overflow requires git annex
or git bup
.. And putting around a script like below would be problematic because git
would recreate new revision hashes each time, making the bundle not apply cleanly.
Also, since in PHP, I can apparently "untar-gz without exec()? - Stack Overflow", maybe it's worth trying to determine changes in directory, then pack only the changed files in a tar.gz
, and send that to PHP script on the server, that would unpack it over the target directory. This would still send entire files, but at least they'd be compressed - but deletions on server would be difficult to handle.
Finally, the binary file diff utilities suggest that one could pack the directories in a .tar(.gz)
each, and then run the utility on those files - e.g. (via ExternalCompression - xdelta - Google Project Hosting):
gzip release-1.tar
gzip release-2.tar
xdelta3 -e -s release-1.tar.gz release-2.tar.gz delta-1-2.xd3
xdelta3 -d -s release-1.tar.gz delta-1-2.xd3 release-2.tar.gz
... probably also doable with JojoDiff / jdiff
jdiff archive0000.tar archive0001.tar archive0001.jdf
jptch archive0000.tar archive0001.jdf archive0001b.tar
... or with bsdiff
. However, that requires me to also maintain a tar archive of the entire site on the webhost, to allow the patches to apply cleanly to it, and the space is issue here again. It would also force me to ask the webhost to allow me the install and use of at least the patching parts of the tools; and that may be worth a try again, if those tools wouldn't also require me to keep an additional tar'd copy of the site on the host.
Anyways, below is the script that demonstrates extraction of a git
.bundle
as a recursive diff between two directories (or rather, two versions of the same directory); relevant terminal output is included in the comments:
#!/usr/bin/env bash
## comments with double ##; (relevant) terminal output with single #
## uses git, ImageMagick, tree
set -x
cd /tmp
rm -rf testdir export_compare
mkdir testdir
cd testdir
git init
# Initialized empty Git repository in /tmp/testdir/.git/
git config user.name "test"
git config user.email "[email protected]"
## generate files - revision 1
## - text
cat /dev/urandom | tr -dc '[ -~]' | fold -w 80 -s | head -c 1024 > test_01.txt
mkdir subdir
cat /dev/urandom | tr -dc '[ -~]' | fold -w 80 -s | head -c 1024 > subdir/subtest_01.txt
cat /dev/urandom | tr -dc '[ -~]' | fold -w 80 -s | head -c 1024 > subdir/subtest_02.txt
## - binary
convert -depth 8 -size 200x150 xc:blue rgb:subdir/rgbimage.dat
## check:
## - files:
tree -s --dirsfirst .
# .
# ├── [ 4096] subdir
# │ ├── [ 90000] rgbimage.dat
# │ ├── [ 1024] subtest_01.txt
# │ └── [ 1024] subtest_02.txt
# └── [ 1024] test_01.txt
#
# 1 directory, 4 files
## - view image (press "q" to quit)
display -depth 8 -size 200x150 rgb:subdir/rgbimage.dat
git add *
git commit -m "initial commit"
## check usage
du -ba --max-depth=1 .
# 1024 ./test_01.txt
# 96144 ./subdir
# 99947 ./.git
# 201211 .
## change files - revision 2
## remove file:
REP="removed 1024;"
git rm subdir/subtest_02.txt
## add file
REP="$REP added 1024;"
cat /dev/urandom | tr -dc '[ -~]' | fold -w 80 -s | head -c 1024 > test_02.txt
git add test_02.txt
## change files:
## - text:
REP="$REP modified/added 19;"
echo "a new changed line" >> test_01.txt
## - binary
REP="$REP modified inline 1200"
convert -depth 8 -size 1x200 xc:red rgb:/dev/stdout | dd of=subdir/rgbimage.dat bs=1 seek=$((200*50*3)) count=$((200*3)) conv=notrunc
convert -depth 8 -size 1x200 xc:red rgb:/dev/stdout | dd of=subdir/rgbimage.dat bs=1 seek=$((200*100*3)) count=$((200*3)) conv=notrunc
## check:
## - files:
tree -s --dirsfirst .
# .
# ├── [ 4096] subdir
# │ ├── [ 90000] rgbimage.dat
# │ └── [ 1024] subtest_01.txt
# ├── [ 1043] test_01.txt
# └── [ 1024] test_02.txt
#
# 1 directory, 4 files
## - view image (press "q" to quit)
display -depth 8 -size 200x150 rgb:subdir/rgbimage.dat
git add *
git commit -m "second commit with changes"
# [master 2b243fb] second commit with changes
# 4 files changed, 16 insertions(+), 19 deletions(-) ...
## check usage
du -ba --max-depth=1 .
# 1043 ./test_01.txt
# 1024 ./test_02.txt
# 95120 ./subdir
# 123355 ./.git
# 224638 .
## go back to parent dir (/tmp) and make a new directory for "clean" exports:
cd /tmp
mkdir export_compare
mkdir export_compare/testdir_new
mkdir export_compare/testdir_old
## from git, export each revision "cleanly"
cd testdir
git archive HEAD | tar -x -C /tmp/export_compare/testdir_new
git archive HEAD^1 | tar -x -C /tmp/export_compare/testdir_old
## create git bundle, containing the changes between new and old revision
git bundle create ../commits_testdir.bundle HEAD HEAD^1
# ... Writing objects: 100% (13/13), 4.30 KiB, done.
# Total 13 (delta 2), reused 0 (delta 0)
## check
cd /tmp
echo $REP
# removed 1024; added 1024; modified/added 19; modified inline 1200
du -b commits_testdir.bundle
# 4470 commits_testdir.bundle
cd export_compare
du -bs testdir_old testdir_new
# 101264 testdir_old
# 101283 testdir_new
tree -s --dirsfirst .
# .
# ├── [ 4096] testdir_new
# │ ├── [ 4096] subdir
# │ │ ├── [ 90000] rgbimage.dat
# │ │ └── [ 1024] subtest_01.txt
# │ ├── [ 1043] test_01.txt
# │ └── [ 1024] test_02.txt
# └── [ 4096] testdir_old
# ├── [ 4096] subdir
# │ ├── [ 90000] rgbimage.dat
# │ ├── [ 1024] subtest_01.txt
# │ └── [ 1024] subtest_02.txt
# └── [ 1024] test_01.txt
#
# 4 directories, 8 files
# + set +x
set +x
I would go for a low key solution. If I get you correctly, you want to update web pages and don't expect most of them changing, in which case I would just upload the changed files, whole.
This could be achieved eg. in
mc
, connect one panel over FTP to your web host, let other panel show the local version. Then select everything and copy, and choose to overwrite only all newer files (you can chose that for all files at once). Or use another file manager's synchronize facility, I believe krusader has some. Unless you've got big files which change only locally (what are they? databases1? executables maybe, but not compressed?), binary deltas won't give you much IMO.NOTE 1: Synchronizing databases in this way is a bad idea.
give git-ftp a try. it was made entirely for that purpose.
I would use something like curlftpfs to mount the servers FTP share to the local machine and then execute rsync locally on those two folders.
This would have the advantage that you don't need to keep the servers state as a copy on the local machine - as it would be necessary if you want to create binary diffs first to send them to the server and then start patching there.
If you still would like to go for the diff/patch solution, you could try to use xargs to create multiple patch-files (with xdelta3 or whatever tool you like), tar and gzip them into one big blob, transfer that to the server and there "reverse" the whole process.
You still would need to find out first, which files need patching at all, so the process would probably look like this:
cmp
to check whether the files differ at allxargs
or params offind
)But be aware, if you implement that patching stuff, you are developing a product that you have to maintain, that's no easy "just configure it" solution anymore.