Disclaimer Yes I'm asking you to design a system for me :)
I'm tasked to design a system to store about 10 TB / day with a retention time of 180 days.
My first approach would be to go with GlusterFS and use a HW setup like this:
Single Node in the System:
- 1 HP ProLiant DL180 G6 with HP Smart Array P812 Controller
- 8 HP D2600 w/12 2 TB 6G SAS 7.2K LFF Dual Port MDL HDD 24 TB Bundle
- 106 Disks for storage (2 OS disks, 10 Data disks in the server, 96 distributed over 8 shelves)
I'd need 9 nodes to get a net storage (without replication or raid on local disks) that can hold the data.
Pros:
- I can start with a single server without shelves
- Grow by adding shelves to a single server (or add servers, just put some thought wether to scale by first adding nodes or first adding shelves or some mix of both)
- scales "infinitely" (for certain definitions of "infinite")
Cons:
- In general: I actually have no idea how to verify wether this will be a viable setup once I reach the final stage of expansion (1.8 PB estimated)
I don't have any actual preferred direction, just some experience with GlusterFS where I have a 4 TB System (distributed, replicated, 4 nodes) already using GlusterFS.
I'm pretty sure that there isn't much of a difference wether this setup runs Hadoop/Gluster/Netapp/EMC/Hitachi/EveryoneElse but the use case is (drumroll):
ls -ltr | grep 'something' | xargs grep somethingelse
Yes that is scary. I tried to convince people to actually run real analytical jobs over that data but as it seems that won't happen. (OK it's not that bad, but those people will use a simple ssh session on some "analysis" system to manually go to some directory, recursivly look thru some files and then determine wether the data is OK or not, which sounds even worse now that I wrote it)
I'm open to any ideas, I do have people that run "big storage" within our company (one backup system has 2PB for example) and I'd love to go with whatever they have that already works. But I also have to proof that they are doing the right thing (please don't ask about this it's a political thing, I'd trust my data to the storage team, I have no idea why I have to duplicate the work)
Thinking about the problem how to actually run analysis on the data is explicitely out of scope.
There were countless meetings and I brought up everything from Splunk to analysis jobs developed in house (with and/or without a Map/Reduce System). There's no interest in that. All the people care about is:
- 10TB / day
- Keep the data 180 days
- Make it highly available (not yet fully defined but something along 99.9, 99.99...)
Well, you didn't mention budget... So buy this now. Data at that scale should probably be left in the hands of a team with experience in that realm. It's nice having support and someone to yell at :)
http://www.racktopsystems.com/products/brickstor-superscalar/
http://www.racktopsystems.com/products/brickstor-superscalar/tech-specs/
Since the application you describe really doesn't seem to be in the realm of clustered storage (given the use-case), use ZFS. You'll get the infinite scalability. You'll get a chance to offload some of the compression to the storage system and you can tell all of your friends about it :)
More than that, the L2ARC caching (using SSDs) will keep the hot data available for analysis at SSD speed.
Edit: Another ZFS-based solution - http://www.aberdeeninc.com/abcatg/petarack.htm
Also, Red Hat is now in the scale-out storage industry.
See: http://www.redhat.com/products/storage/storage-software/
As MDMarra mentions you need Splunk for this, I'm a big user and fan, for very similar volumes as you discuss and right away it'll save you having to buy anywhere near that much storage and reduce all the complexity. One decent sized server (maybe 150-200TB max) will do the job if used with Splunk, it's on-the-fly indexing is perfect for this kind of thing and it's search capabilities far outstrip anything you'll manage yourself. It's not free of course but I'd not consider anything else.