I have a simple line of rsync in my crontab that gets backup files from the prod server to another.
It looks like it is touching the already existing files in the destination folder. This way, the backup would incrementally take longer each interval.
Please take a look at the date and time the files below have changed.
How do I use rsync not to touch (and download?) the files it already has. I don't need any checksums calculated either, once the backups are created, they won't change anymore.
rsync -vzre 'ssh' stor@server:/backup/system/ /storage/share/Backup/Server
The files to be fetched:
-rw-r-x--- 1 root stor 896K Jun 22 05:02 giant-140622-etc.zip
-rw-r-x--- 1 root stor 620K Jun 22 05:02 giant-140622-sql.zip
-rw-r-x--- 1 root stor 84M Jun 22 05:02 giant-140622-www.zip
-rw-r-x--- 1 root stor 899K Jun 25 05:00 giant-140625-etc.zip
-rw-r-x--- 1 root stor 603K Jun 25 05:00 giant-140625-sql.zip
-rw-r-x--- 1 root stor 84M Jun 25 05:00 giant-140625-www.zip
-rw-r-x--- 1 root stor 899K Jun 28 05:00 giant-140628-etc.zip
-rw-r-x--- 1 root stor 620K Jun 28 05:00 giant-140628-sql.zip
-rw-r-x--- 1 root stor 86M Jun 28 05:00 giant-140628-www.zip
-rw-r-x--- 1 root stor 899K Jun 30 05:00 giant-140630-etc.zip
-rw-r-x--- 1 root stor 617K Jun 30 05:00 giant-140630-sql.zip
-rw-r-x--- 1 root stor 86M Jun 30 05:00 giant-140630-www.zip
The destination:
-rw-r-x--- 1 stor stor 896K Jun 30 06:06 giant-140622-etc.zip
-rw-r-x--- 1 stor stor 620K Jun 30 06:06 giant-140622-sql.zip
-rw-r-x--- 1 stor stor 84M Jun 30 06:06 giant-140622-www.zip
-rw-r-x--- 1 stor stor 899K Jun 30 06:06 giant-140625-etc.zip
-rw-r-x--- 1 stor stor 603K Jun 30 06:06 giant-140625-sql.zip
-rw-r-x--- 1 stor stor 84M Jun 30 06:06 giant-140625-www.zip
-rw-r-x--- 1 stor stor 899K Jun 30 06:06 giant-140628-etc.zip
-rw-r-x--- 1 stor stor 620K Jun 30 06:06 giant-140628-sql.zip
-rw-r-x--- 1 stor stor 86M Jun 30 06:06 giant-140628-www.zip
-rw-r-x--- 1 stor stor 899K Jun 30 06:07 giant-140630-etc.zip
-rw-r-x--- 1 stor stor 617K Jun 30 06:08 giant-140630-sql.zip
-rw-r-x--- 1 stor stor 86M Jun 30 06:10 giant-140630-www.zip
Update:
When I run the rsync
command (with the --skip-existing
arg) from the shell, it only downloads non-existing new files and skips the files it already has.
When investigating the behaviour of the exact same command run by a cronjob, the already existing files do change every cycle and the whole job takes incrementally longer each cycle (compare the times above, cronjob starting at 06:00, 2 minutes per file even if they already exist).
rsync -vzr --ignore-existing -e 'ssh -i /path/id_rsa -l backup' [email protected]:/backup/system/ /nfs/share-private/Backup/Server
Update:
Here are the files form july, I put an extra blank line into, please see the times, which started by 06:01
and raise each new files.
-rw-r-x--- 1 stor stor 899K Jul 4 06:01 giant-140702-etc.zip
-rw-r-x--- 1 stor stor 621K Jul 4 06:01 giant-140702-sql.zip
-rw-r-x--- 1 stor stor 86M Jul 4 06:03 giant-140702-www.zip
^-- 01 to 03
-rw-r-x--- 1 stor stor 899K Jul 4 06:04 giant-140704-etc.zip
-rw-r-x--- 1 stor stor 634K Jul 4 06:05 giant-140704-sql.zip
-rw-r-x--- 1 stor stor 86M Jul 8 06:02 giant-140704-www.zip
^-- ???
-rw-r-x--- 1 stor stor 899K Jul 8 06:03 giant-140706-etc.zip
-rw-r-x--- 1 stor stor 629K Jul 8 06:03 giant-140706-sql.zip
-rw-r-x--- 1 stor stor 86M Jul 8 06:06 giant-140706-www.zip
^-- 03 - 06
-rw-r-x--- 1 stor stor 899K Jul 8 06:07 giant-140708-etc.zip
-rw-r-x--- 1 stor stor 629K Jul 8 06:07 giant-140708-sql.zip
-rw-r-x--- 1 stor stor 86M Jul 8 06:10 giant-140708-www.zip
^-- 07 - 10
Now when I imagine going on another month, the time would be like:
-rw-r-x--- 1 stor stor 899K Jul 8 06:32 giant-140808-etc.zip
-rw-r-x--- 1 stor stor 629K Jul 8 06:32 giant-140808-sql.zip
-rw-r-x--- 1 stor stor 86M Jul 8 06:35 giant-140808-www.zip
^-- what I imagine to happen
By default
rsync
will read the entire file on both source and destination, to verify that they are identical. This does not consume network bandwidth, as it will only be comparing a hash value. But it does spend time reading from the disk.In one usage scenario, I found this to be terribly inefficient because the source files were only being appended to. I used the
--size-only
, which worked well for me.There is a few other options, which look like they may be applicable,
--append
and--append-verify
, but I haven't tested those myself.It does not look like you have a directory with a lot of small files, so the time to read the directory listing from disk and stat each file, shouldn't be much of a problem.
I added the
--ignore-existing
command and it looks like it won't change anything and only download new files.Edit: When there are new files it still takes longer each cycle.
I think adding
-t
to your argument list will help.To verify this you could add
--itemize-changes
to the arguments (without-t
). If I understood you correctly, this would show theT
-flag in every lineman 1 rspec
:After this add
-t
to the command (keep--itemize-changes
) and you will receive at
-flag on every line. In the next run the list will only contain the new files.This is my example run:
why do you say it takes longer each time? how is that possible?
maybe it's the program generating the files that is touching them?
try with
--checksum
: skip based on checksum, not mod-time & size, see if that changes anything (i wouldn't keep this option because it reads every file from the disk every time, too expensive, i'm only suggesting it to find the problem.)(and maybe try to debug with the
-t
option, that preserves modification times)