I'm looking at using S3 as an offsite backup repo for my Subversion database. When I dump my SVN database, it's about 10 gigabytes. I would like to avoid the charge of uploading that data repeatedly.
The anatomy of this large file such that new changes to Subversion modify the tail of the file, with everything else staying the same. Because Amazon S3 does not allow you to "patch" files with changes, I will have to upload ten gigs every time I instantiate a backup after doing a simple submit to Subversion.
Here are the options as I see them:
Option 1
I am looking at duplicity which has --volsize
which splits data over an amount of megs. Is it possible to split the Subversion dumps using this so further incremental backups are measured in megabytes?
Option 2 Can I just backup the hot subversion repository? This seems like a bad idea if it is in the middle of writing a submit. However, I have the option of taking the repo offline between the hours of midnight and 4am. Each revision in my Berkeley DB uses a file as its record.
Why not convert your repo to use the FSFS format instead of BDB?
That way each revision will be stored as a separate file, so incremental backups will just send the revisions which have been committed since the last backup.
You could put up a small Amazon EC2 instance and backup to an Elastic Block Store (EBS) volume via rsync or whatever tool that you prefer. Once the backup is complete, take a snapshot, which will be persisted to S3.
It's a somewhat more complex solution in some respects, but makes up for some of the limitations/complexities with S3.
I know this isn't really an answer but why not use an SVN provider and not worry about this stuff?
Another solution is to use git where each user has a full copy of all the deltas so you can recover from a server failure (since all are equal).
Since I had to do this recently, I'd like to add that backup manager did the trick. It can bzip the dump and rotate it on s3. I used this for reference.