I'm hosting 200 GB of product images at S3 (this is my primary file host).
Do I need to back that data up somewhere else, or is S3 safe as it is?
I have been experimenting with mounting the S3 bucket to a EC2 instance, and then making a nightly rsync backup. The problem is that it's about 3 million files, so it takes a while to generate the different rsync needs. The backup actually takes about 3 days to complete.
Any ideas how to do this better? (if it's even necessary?)
I've been doing research on this, funny enough.
Your backups to S3 can fail depending on your region because of eventual consistency; the basic warning is that if you do this enough, at some point you'll have errors opening or finding files as the filesystem magic in the background of Amazon syncs among servers, so your backups may not be reliable.
As for whether you need to save them another way, this depends on your risk management. Do you trust Amazon to hold your data?
It's possible they may lose something or have a larger failure of their storage system; they no doubt have clauses in their contracts specifying that if they lose your data, that's your problem. Not theirs. Also, seeing as your data is housed somewhere else, you don't know what they will do with it; law enforcement want your data? You may not even know someone else accessed it.
Do you trust it? If the data isn't key to your business and you're willing to accept this risk, then there's no need to download it to offsite-storage. If you are not willing to risk that your data will be safe in Amazon's storage servers out there, you should make arrangements to periodically dump it to your own storage.
In other words I don't think there is a straight answer to this as it depends on your risk tolerance and business needs. Many people wouldn't completely trust their income solely on storage with the cloud, personally I feel a little wary of that...
To do this better, in discussions and research, another approach to consider is creating an EBS volume large enough to store the data, attach it to the EC2 instance, save your data there, then you can unmount the volume and save that data to S3. I'm in the middle of researching whether this would be done as saving the volume file itself to S3 or the contents...but then you can delete the EBS instance when done to save storage costs.
EDIT I see in re-reading that you're saving FROM S3 TO the EC2 instance, not vice-versa (although I don't know if the eventual consistency issue could still cause problems there). You're trying to save data to an EC2 instance as backup? I would think that cost-wise that's not a sound tactic; it may be cheaper to back things up to a local drive when you factor in long-term storage of that kind of data, along with VM time. With drive costs you could copy data down to a local disk as a backup.
I still would keep the warnings about trusting Amazon and their storage. If you want to keep everything in Amazon S3 but have more redundancy, duplicate your S3 buckets across regions, and if they have an outage affecting one region it shouldn't knock out all of them. You'd hope. Anything is possible though.
It comes down to how much you value your data, how much you're willing to pay for it and how much risk you want to tolerate.
I've used s3cmd's
s3cmd sync
to do this. It's a bit rsync-like in it's operation, and can push and pull whole directories between S3 and another linux system of your choice.I don't see any reason why you couldn't
s3cmd sync
to a running EC2 instance, or even your own developer workstation (or a storage server).You might want to set up a VPC instance, and then you could assign a small node inside your VPC the role of backup server, and give it both an IP inside Amazon's network, as well as inside of your local subnet.
My advice is your data is your responsibility, not Amazon's. If losing the data is not such a big deal, then don't do your own backup. If it is, then take your own backup to (at the very least) a cheap JBOD (and verify regularly) as I do.
You'll find out how much responsibility Amazon is willing to assume for your data, the day they lose it.
If you can afford it (as I do this) is have all my data stored on my server, but pulling it from Amazon s3. So if Amazon goes down for any reason (touch wood) I can simply pull all my data instantly right from my server. From my server I make monthly backups to my local drive. As my website is over 2TB in site.
Although this is an old thread, it's the first thing that comes up when Googling S3 backup, so I thought I'd add to it...
Doing some research on this myself, I discovered Rclone https://rclone.org/ - it is rsync-ish software designed to copy between cloud file storage services and supports most of them. No affiliation and I haven't used it yet so I can't say if it's good or bad, but I thought it might help someone.
It seems to me that there is opportunity for a hosted service that does 'offsite' backups of cloud-hosted files (S3, Google Storage, Rackspace Cloud Files, etc)....