The web application I'm working on, will be used to upload/download large number/amounts of smaller size files - I'm looking at close to 1B files with total size of > 10Pb. I'm currently struggling with deciding the scalable architecture that would support such amounts. And here's my question - is there a way of building some sort of storage that would be seen by a windows server as one huge (10Pb and up) network storage drive, so I can write all the files to subfolders of that virtual drive? And how would it perform?
Right now I'm trying to understand if that's even possible, or if I have to implement software level sharding - writing files to different drives based on some key.
I'm a developer, not a sys admin, so I apologize if it's a naive question, and thanks in advance for patience in explaining me possibly trivial things.
Andrey
as a 'normal but huge' fileserver:
with a file-like application level library:
generic key-value:
Check out how Backblaze is storing its data. Very good read and they have a blog about the new 3TB drives. This probably will not answer the question about file system. I am not sure how Backblaze does there file structure. But good information nevertheless.
Before you continue looking, you need to decide a bit more exactly what kind of semantics you need. For instance, you say they're files - do you need POSIX file semantics (mostly concerned with consistency and locking) on them on the storage? or is 'eventual consistency' of various distributed datastores enough? What are your I/O requirements: how much concurrent access? What are your redundancy requirements? Also: what kind of hardware are you going to use? 10Pb arrays don't grow on trees and just managing them is a full time job - that much hardware means failure is a normal event, so constant repair and replacement is needed.
From what you've said "web application... storing files..." I think an OpenStack or S3 kind of solution should do you. Since you're mostly a developer, I'd suggest you probably want to actually use amazon or Rackspace or whoever as your provider unless you really want to get into the hardware management biz.
These days you might consider HDFS and the general Hadoop ecosystem.