How to move files between two S3 buckets with minimum cost?
772
I have millions of files in a Amazon S3 bucket and I'd like to move these files to other buckets and folders with minimum cost or no cost if possible. All buckets are in the same zone.
Millions is a big number - I'll get back to that later.
Regardless of your approach, the underlying mechanism needs to be copying directly from one bucket to another - in this way (since your buckets are in the same region) you do not incur any charge for bandwidth. Any other approach is simply inefficient (e.g. downloading and reuploading the files).
Copying between buckets is accomplished by using 'PUT copy' - that is a PUT request that includes the 'x-amz-copy-source' header - I believe this is classed as a COPY request. This will copy the file and by default the associated meta-data. You must include a 'x-amz-acl' with the correct value if you want to set the ACL at the same time (otherwise, it will default to private). You will be charged for your COPY requests ($0.01/1,000 requests). You can delete the unneeded files after they have been copied (DELETE requests are not charged). (One point I am not quite clear on is whether or not a COPY request also incurs the charge of a GET request, as the object must first be fetched from the source bucket - if it does, the charge will be an additional $0.01/10,000 requests).
The above charges are seemingly unavoidable - for a million objects you are looking at around $10 (or $11). Since in the end you must actually create the files on the destination bucket, other approaches (e.g. tar-gzipping the files, Amazon Import/Export, etc) will not get around this cost. None the less, it might be worth your while contacting Amazon if you have more than a couple million objects to transfer.
Given the above (unavoidable price), the next thing to look into is time, which will be a big factor when copying 'millions of files'. All tools that can perform the direct copy between buckets will incur the same charge. Unfortunately, you require one request per file (to copy), one request to delete, and possibly one request to read the ACL data (if your files have varied ACLs). The best speed will come from whatever can run the most parallel operations.
There are some command line approaches that might be quite viable:
s3cmd-modification (that specific pull request) includes parallel cp and mv commands and should be a good option for you.
The AWS console can perform the copy directly - I can't speak for how parallel it is though.
Tim Kay's aws script can do the copy - but it is not parallel - you will need to script it to run the full copy you want (probably not the best option in this case - although, it is a great script).
CloudBerry S3 Explorer, Bucket Explorer, and CloudBuddy should all be able to perform the task, although I don't know how the efficiency of each stacks up. I believe though that the multi-threaded features of most of these require the purchase of the software.
Script your own using one of the available SDKs.
There is some possibility that s3fs might work - it is quite parallel, does support copies between the same bucket - does NOT support copies between different buckets, but might support moves between different buckets.
I'd start with s3cmd-modification and see if you have any success with it or contact Amazon for a better solution.
Old topic, but this is for anyone investigating the same scenario. Along with the time it took me, for 20,000+ objects. Running on AWS Linux/Centos, each object being images for the most part, along with some video and various media files.
Using the AWS CLI Tools to Copy the files from Bucket A to Bucket B.
The following commands will tell the AWS CLI to use 1,000 threads to execute jobs (each a small file or one part of a multipart copy) and look ahead 100,000 jobs:
aws configure set default.s3.max_concurrent_requests 1000
aws configure set default.s3.max_queue_size 100000
After running these, you can use the simple sync command as follows:
On an m4.xlarge machine (in AWS--4 cores, 16GB RAM), for my case (3-50GB files) the sync/copy speed went from about 9.5MiB/s to 700+MiB/s, a speed increase of 70x over the default configuration.
I'd imagine you've probably found a good solution by now, but for others who are encountering this problem (as I was just recently), I've crafted a simple utility specifically for the purpose of mirroring one S3 bucket to another in a highly concurrent, yet CPU and memory efficient manner.
Millions is a big number - I'll get back to that later.
Regardless of your approach, the underlying mechanism needs to be copying directly from one bucket to another - in this way (since your buckets are in the same region) you do not incur any charge for bandwidth. Any other approach is simply inefficient (e.g. downloading and reuploading the files).
Copying between buckets is accomplished by using 'PUT copy' - that is a PUT request that includes the 'x-amz-copy-source' header - I believe this is classed as a COPY request. This will copy the file and by default the associated meta-data. You must include a 'x-amz-acl' with the correct value if you want to set the ACL at the same time (otherwise, it will default to private). You will be charged for your COPY requests ($0.01/1,000 requests). You can delete the unneeded files after they have been copied (DELETE requests are not charged). (One point I am not quite clear on is whether or not a COPY request also incurs the charge of a GET request, as the object must first be fetched from the source bucket - if it does, the charge will be an additional $0.01/10,000 requests).
The above charges are seemingly unavoidable - for a million objects you are looking at around $10 (or $11). Since in the end you must actually create the files on the destination bucket, other approaches (e.g. tar-gzipping the files, Amazon Import/Export, etc) will not get around this cost. None the less, it might be worth your while contacting Amazon if you have more than a couple million objects to transfer.
Given the above (unavoidable price), the next thing to look into is time, which will be a big factor when copying 'millions of files'. All tools that can perform the direct copy between buckets will incur the same charge. Unfortunately, you require one request per file (to copy), one request to delete, and possibly one request to read the ACL data (if your files have varied ACLs). The best speed will come from whatever can run the most parallel operations.
There are some command line approaches that might be quite viable:
There is some possibility that s3fs might work - it is quite parallel, does support copies between the same bucket - does NOT support copies between different buckets, but might support moves between different buckets.
I'd start with s3cmd-modification and see if you have any success with it or contact Amazon for a better solution.
Old topic, but this is for anyone investigating the same scenario. Along with the time it took me, for 20,000+ objects. Running on AWS Linux/Centos, each object being images for the most part, along with some video and various media files.
Using the AWS CLI Tools to Copy the files from Bucket A to Bucket B.
A. Create the new bucket
B. Sync the old bucket with new bucket
Copying 20,000+ objects...
Started 17:03
Ended 17:06
Total time for 20,000+ objects = roughly 3 minutes
Once the new bucket is correctly configured, I.e. permissions, policy etc. and you wish to remove the old bucket.
C. Remove/delete the old bucket
The AWS CLI provides a way to copy one bucket to another in parallel processes. Taken from https://stackoverflow.com/a/40270349/371699:
The following commands will tell the AWS CLI to use 1,000 threads to execute jobs (each a small file or one part of a multipart copy) and look ahead 100,000 jobs:
After running these, you can use the simple sync command as follows:
On an m4.xlarge machine (in AWS--4 cores, 16GB RAM), for my case (3-50GB files) the sync/copy speed went from about 9.5MiB/s to 700+MiB/s, a speed increase of 70x over the default configuration.
I am not sure it is the best approach but the AWS management console has a cut/copy/paste feature. very easy to use and efficient.
I'd imagine you've probably found a good solution by now, but for others who are encountering this problem (as I was just recently), I've crafted a simple utility specifically for the purpose of mirroring one S3 bucket to another in a highly concurrent, yet CPU and memory efficient manner.
It's on github under an Apache License here: https://github.com/cobbzilla/s3s3mirror
If you decide to give it a try please let me know if you have any feedback.
In the losing bucket, select the file(s) that you would like to copy over to another bucket.