At my organization we have a processing and storage system spread across two dozen linux machines that handles over a petabyte of data. The system right now is very ad-hoc; processing automation and data management is handled by a collection of large perl programs on independent machines. I am looking at distributed processing and storage systems to make it easier to maintain, evenly distribute load and data with replication, and grow in disk space and compute power.
The system needs to be able to handle millions of files, varying in size between 50 megabytes to 50 gigabytes. Once created, the files will not be appended to, only replaced completely if need be. The files need to be accessible via HTTP for customer download.
Right now, processing is automated by perl scripts (that I have complete control over) which call a series of other programs (that I don't have control over because they are closed source) that essentially transforms one data set into another. No data mining happening here.
Here is a quick list of things I am looking for:
Reliability: These data must be accessible over HTTP about 99% of the time so I need something that does data replication across the cluster.
Scalability: I want to be able to add more processing power and storage easily and rebalance the data on across the cluster.
Distributed processing: Easy and automatic job scheduling and load balancing that fits with processing workflow I briefly described above.
Data location awareness: Not strictly required but desirable. Since data and processing will be on the same set of nodes I would like the job scheduler to schedule jobs on or close to the node that the data is actually on to cut down on network traffic.
Here is what I've looked at so far:
Storage Management:
GlusterFS: Looks really nice and easy to use but doesn't seem to have a way to figure out what node(s) a file actually resides on to supply as a hint to the job scheduler.
GPFS: Seems like the gold standard of clustered filesystems. Meets most of my requirements except, like glusterfs, data location awareness.
Ceph: Seems way to immature right now.
Distributed processing:
- Sun Grid Engine: I have a lot of experience with this and it's relatively easy to use (once it is configured properly that is). But Oracle got its icy grip around it and it no longer seems very desirable.
Both:
Hadoop/HDFS: At first glance it looked like hadoop was perfect for my situation. Distributed storage and job scheduling and it was the only thing I found that would give me the data location awareness that I wanted. But I don't like the namename being a single point of failure. Also, I'm not really sure if the MapReduce paradigm fits the type of processing workflow that I have. It seems like you need to write all your software specifically for MapReduce instead of just using Hadoop as a generic job scheduler.
OpenStack: I've done some reading on this but I'm having trouble deciding if it fits well with my problem or not.
Does anyone have opinions or recommendations for technologies that would fit my problem well? Any suggestions or advise would be greatly appreciated.
Thanks!
Sounds like you're most of the way to what you need. The technologies out there (GlusterFS, GPFS) have the features you're looking for but not the data-locality one. Depending on what you're doing with the data, this could be built into your job dispatcher.
To me it sounds like you need to build in an indexing stage to your processing pipeline that determines data-locality. This can be parallelized and then re-serialized in a database, though this step is likely custom code (you know your data better than I). Once you have data-locality, packaging processing work-units for worker nodes should be fairly straight-forward; build work-units for node-local data first, and then node-adjacent (if that concept applies in your case), and finally a global context used when most processing is done but a few work-units seem to be taking an age and their node-local processor is busy chewing on them.
That's a high-level view. Focusing closer to the bolts of the problem. Judging on what you've said so far, it sounds like you're working on larger chunks of data, and you want processing to be done on local storage for bandwidth reasons. I see some options:
First idea, when data is ingested from your source it gets copied into the Gluster/GPFS/whatever distributed filesystem. Then, run the indexing process I described above. Then as workers process data, processed datasets get reported back to another group of servers whose role is to serve up processed data over HTTP. The report-back method can even be done via HTTP PUTs, which then drop the data onto another replicated file-system. The disadvantage of this method is that it does store your data twice (original and modified), but I don't know if that's something you're already doing. This allows you to scale out your processing infrastructure quite far, while keeping your client-serving infrastructure sanely small.
Second idea, as above, but when workers get done processing a work-unit the saved data gets stored to the Gluster/GPFS/whatever filesystem. The HTTP servers then serve data directly from those repositories, but they're not as concerned with node-local as the processing nodes are. For this it's probably a good idea to have separate client-serving and processing networks to limit the double-transit problem with those large data-sets.
Third idea, if figuring out data-locality of GPFS/Gluster isn't really doable (I haven't used them, so I'm not sure) you may want to look into building your own kind of storage. It's a lot of work, bit if you really need locality, it may be worth it for you. As you ingest data, each dataset gets indexed in a database and HTTP PUTed to multiple nodes as needed. When processing happens, jobs are created for individual nodes for data that is node-local to themselves first. When a worker receives a job, it HTTP GETs the data from the node specified by the database (which should be itself, but doesn't have to be). When work is completed, it notifies the database and receives instructions about where to PUT the results.
For serving processed datasets to clients, you'll probably have to introduce some application code to translate files to fetch to proxied HTTP GETs from your nodes.
This does introduce a high-bandwidth part of the process in the form of that database. It can have multiple load-balanced web-servers in front of it for processing logic, but the database itself does end up being a single-point-of-failure (though people more knowledgeable in the ways of databases may know ways around that). The database is in essence acting as the file allocation table for a large HTTP-based file-system. Since your processing appears to need very simple filesystem semantics (fetch/put, possibly lock/unlock for a node that's processing a dataset) that can be mediated by such a database. Obviously this DB will be very large, so some of the NoSQL technologies may be a better fit for performance reasons.
I know this isn't the specific technologies you're looking for, it's more about techniques for working around deficiencies in the market. Data-locality and replication is something of an edge case, it seems. As it happens, we do something similar to you just with smaller datasets, so it's a topic that's on my mind as well.