Ping a Specific Port

Question

Martin M.

Asked: 2012-02-24 14:25:27 +0800 CST2012-02-24 14:25:27 +0800 CST 2012-02-24 14:25:27 +0800 CST

HW/SW Design: 2 Petabyte of storage

772

Disclaimer Yes I'm asking you to design a system for me :)

I'm tasked to design a system to store about 10 TB / day with a retention time of 180 days.

My first approach would be to go with GlusterFS and use a HW setup like this:

Single Node in the System:

1 HP ProLiant DL180 G6 with HP Smart Array P812 Controller
8 HP D2600 w/12 2 TB 6G SAS 7.2K LFF Dual Port MDL HDD 24 TB Bundle
106 Disks for storage (2 OS disks, 10 Data disks in the server, 96 distributed over 8 shelves)

I'd need 9 nodes to get a net storage (without replication or raid on local disks) that can hold the data.

Pros:

I can start with a single server without shelves
Grow by adding shelves to a single server (or add servers, just put some thought wether to scale by first adding nodes or first adding shelves or some mix of both)
scales "infinitely" (for certain definitions of "infinite")

Cons:

In general: I actually have no idea how to verify wether this will be a viable setup once I reach the final stage of expansion (1.8 PB estimated)

I don't have any actual preferred direction, just some experience with GlusterFS where I have a 4 TB System (distributed, replicated, 4 nodes) already using GlusterFS.

I'm pretty sure that there isn't much of a difference wether this setup runs Hadoop/Gluster/Netapp/EMC/Hitachi/EveryoneElse but the use case is (drumroll):

ls -ltr | grep 'something' | xargs grep somethingelse

Yes that is scary. I tried to convince people to actually run real analytical jobs over that data but as it seems that won't happen. (OK it's not that bad, but those people will use a simple ssh session on some "analysis" system to manually go to some directory, recursivly look thru some files and then determine wether the data is OK or not, which sounds even worse now that I wrote it)

I'm open to any ideas, I do have people that run "big storage" within our company (one backup system has 2PB for example) and I'd love to go with whatever they have that already works. But I also have to proof that they are doing the right thing (please don't ask about this it's a political thing, I'd trust my data to the storage team, I have no idea why I have to duplicate the work)

Thinking about the problem how to actually run analysis on the data is explicitely out of scope.

There were countless meetings and I brought up everything from Splunk to analysis jobs developed in house (with and/or without a Map/Reduce System). There's no interest in that. All the people care about is:

10TB / day
Keep the data 180 days
Make it highly available (not yet fully defined but something along 99.9, 99.99...)

2 Answers

Voted

ewwhite · Answer 1 · 2012-02-24T14:52:52+08:00

Well, you didn't mention budget... So buy this now. Data at that scale should probably be left in the hands of a team with experience in that realm. It's nice having support and someone to yell at :)

http://www.racktopsystems.com/products/brickstor-superscalar/

http://www.racktopsystems.com/products/brickstor-superscalar/tech-specs/

4 x Storage Heads BrickStor Foundation Units
10 x BrickStor Bricks (36 x 3.5″ Bay JBOD)
2 x 16-port SAS switch
1 x pullout rackmount KVM
1 x 48U Rack
1 x 10Gb Network Switch (24 x 10Gb non-Blocking)
NexentaStor Plug-ins:VMDC, WORM, HA-cluster or Simple-HA
Onsite installation 5-days
24/7/365 day email and phone support
Onsite Support

Since the application you describe really doesn't seem to be in the realm of clustered storage (given the use-case), use ZFS. You'll get the infinite scalability. You'll get a chance to offload some of the compression to the storage system and you can tell all of your friends about it :)

More than that, the L2ARC caching (using SSDs) will keep the hot data available for analysis at SSD speed.

Edit: Another ZFS-based solution - http://www.aberdeeninc.com/abcatg/petarack.htm

Also, Red Hat is now in the scale-out storage industry.

See: http://www.redhat.com/products/storage/storage-software/

Chopper3 · Answer 2 · 2012-02-24T15:42:37+08:00

Chopper3

2012-02-24T15:42:37+08:002012-02-24T15:42:37+08:00

As MDMarra mentions you need Splunk for this, I'm a big user and fan, for very similar volumes as you discuss and right away it'll save you having to buy anywhere near that much storage and reduce all the complexity. One decent sized server (maybe 150-200TB max) will do the job if used with Splunk, it's on-the-fly indexing is perfect for this kind of thing and it's search capabilities far outstrip anything you'll manage yourself. It's not free of course but I'd not consider anything else.

2

HW/SW Design: 2 Petabyte of storage

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?