I need to set up an SFTP server that, essentially, has very large capacity. I need to give one of our partners SFTP login details to a server where they will upload millions of files, totalling a few hundred Terabytes. I will then be selective and quite rarely reading some of these files. This is the only actual requirement, any technology choice is up for grabs.
What comes to mind as the easiest way is to have some sort of EC2 instance running the SFTP server in such a way that anything uploaded is either directly sent to S3, or some sort of process discovers new files when they get uploaded, copies them to S3, and deletes them from disk.
Is this the best way? Is there any other way of getting a server that essentially has "infinite and magically growing disk space"?
Thanks for your help! Daniel
I answered this same question on Stack Overflow.
s3fs is indeed a reasonable solution, and in my case, I've coupled it with proftpd with excellent results, in spite of the theoretical/potential problems.
At the time I wrote the answer, I had only set this up for one of my consulting clients... but since then, I've also started drinking my own kool-aid and am using it in production at my day job. Companies we exchange data with upload and download files all day long on my sftp server, which is storing everything directly on S3. As a bonus, my report exporting system -- which writes excel spreadsheets directly to S3 -- can export reports "to the FTP server" by simply putting them directly into the ftp server's bucket, with appropriate metadata to show the uid, gid, and mode of each file. (s3fs uses x-amz-meta-uid, -gid, and -mode headers to emulate filesystem permissions). When the client logs on to the server, the report files are just... there.
I do think the ideal solution would probably be an sftp to S3 gateway service, but I still haven't gotten around to designing one, since this solution works really well... with some caveats, of course:
Not all of the default values for s3fs are sane. You will probably want to specify these options:
It's probably best to use a region other than US-Standard, because that's the only region that doesn't offer read-after-write consistency on new objects. (Or, if you need to use US-Standard, you can use the almost undocumented hostname
your-bucket.s3-external-1.amazonaws.com
from the us-east-1 region to prevent your requests from being geo-routed, which may improve consistency.)I have object versioning enabled on the bucket, which s3fs is completely unaware of. The benefit of this is that even if a file should get "stomped," I can always go to bucket versioning to recover the "overwritten" file. Object versioning in S3 was brilliantly designed in such a way that S3 clients that are unaware of versioning are in no way disabled or confused, because if you don't make versioning-aware REST calls, the responses S3 returns are compatible with clients that have no concept of versioning.
Note also that transferring data into S3 is free of data transfer charges. You pay only the per-request pricing. Transferring data out of S3 into EC2 within a region is also free of data transfer charges. It's only when you transfer out of S3 to the Internet, to Cloudfront, or to another AWS region that you pay transfer charges. If you want to use the lower-priced reduced-redundancy storage, s3fs supports that with
-o use_rrs
.As an amusing aside, you'll always get a warm fuzzy feeling when you see the 256 terabytes of free space (and 0 used, since a real calculation of sizes is impractical because of the fact that S3 is an object store, not a filesystem).
Of course, you can mount the bucket anywhere. I just happen to have it in /srv/s3fs.
Check out the SFTP Gateway on the AWS Marketplace.
We experienced reliability issues with s3fs, so we developed a custom solution specifically for this purpose. We've been using it in production for several years without issue and have recently released it to the AWS Marketplace.
There are two options. You can use a native managed SFTP service recently added by Amazon (which is easier to set up). Or you can mount the bucket to a file system on a Linux server and access the files using the SFTP as any other files on the server (which gives you greater control).
Managed SFTP Service
In your Amazon AWS Console, go to AWS Transfer for SFTP and create a new server.
In SFTP server page, add a new SFTP user (or users).
Permissions of users are governed by an associated AWS role in IAM service (for a quick start, you can use AmazonS3FullAccess policy).
The role must have a trust relationship to
transfer.amazonaws.com
.For details, see my guide Setting up an SFTP access to Amazon S3.
Mounting Bucket to Linux Server
As @Michael already answered, just mount the bucket using the
s3fs
file system (or similar) to a Linux server (Amazon EC2) and use the server's built-in SFTP server to access the bucket.Here are basic instructions:
Install the
s3fs
Add your security credentials in a form
access-key-id:secret-access-key
to the/etc/passwd-s3fs
Add a bucket mounting entry to the
fstab
:For details, see my guide Setting up an SFTP access to Amazon S3.
Use S3 Client
Or use any free "FTP/SFTP client", that's also an "S3 client", and you do not have setup anything on server-side. For example, my WinSCP or Cyberduck.
AWS now provides an SFTP over S3 service called AWS Transfer For SFTP. It has the benefits of S3 (highly durable, available, distributed storage) combined with the well known and established SFTP protocol.
By default, users authenticate using private/public key pairs, and using IAM policies you can set up permissions for SFTP users on S3 buckets. You can add authentication schemes by implementing your own functionality on AWS API Gateway and AWS Lambda.
We've wrapped AWS Transfer for SFTP in a Heroku add-on called SFTP To Go to both provide flexible authentication schemes and lower TCO (as a service endpoint has a fixed cost on AWS, but can be shared by many users without any security or performance compromise.