Ping a Specific Port

Question

Alexander Gladysh

Asked: 2013-06-19 02:00:28 +0800 CST2013-06-19 02:00:28 +0800 CST 2013-06-19 02:00:28 +0800 CST

A roundrobin for incoming files

772

A bunch of new files with unique filenames regularly "appears"¹ on one server. (Like hundreds GBs of new data daily, solution should be scalable to terabytes. Each file is several megabytes large, up to several tens of megabytes.)

There are several machines that process those files. (Tens, should solution be scalable to hundreds.) It should be possible to easily add and remove new machines.

There are backup file storage servers on which each incoming file must be copied for archival storage. The data must not be lost, all incoming files must end up delivered on the backup storage server.

Each incoming file myst be delivered to a single machine for processing, and should be copied to the backup storage server.

The receiver server does not need to store files after it sent them on their way.

Please advise a robust solution to distribute the files in the manner, described above. Solution must not be based on Java. Unix-way solutions are preferable.

Servers are Ubuntu-based, are located in the same data-center. All other things can be adapted for the solution requirements.

¹Note that I'm intentionally omitting information about the way files are transported to the filesystem. The reason is that the files are being sent by third parties by several different legacy means nowadays (strangely enough, via scp, and via ØMQ). It seems easier to cut the cross-cluster interface at the the filesystem level, but if one or another solution actually will require some specific transport — legacy transports can be upgraded to that one.

2 Answers

Voted

sysadmin1138 · Answer 1 · 2013-06-19T03:32:51+08:00

Here is one solution to what you're looking for. No java is involved in the making of this system, just readily available Open Source bits. The model presented here can work with other technologies than the ones I'm using as an example.

Scalable Upload

Files are HTTP POSTed to a specific Round-Robin DNS address.
The system POSTing the files then drops a job into an AMQP system (Rabbit MQ here), by way of another pair of load-balancers, to start the processing workflow.
The Load Balancers receiving the HTTP POST are each in front of a group of OpenStack Swift object store servers.
- The load-balancers each have two or more OpenStack Swift object-store servers behind them.
- 'Round Robin is not HA' can be if the targets are HA themselves. YMMV.
- For extra durability, the IPs in the RRDNS could be individual hot-standby LB clusters.
The Object Store server that actually gets the POST delivers the file to a Gluster-based file-system.
- The Gluster system should be both Distributed (a.k.a. sharded) and Replicated. This allows it to scale to silly densities.
The AMQP system dispatches the first job, make the backup, to an available processing node.
Processing node copies the file from main storage to backup storage and reports success/failure as needed.
- Failure mode processing is not diagrammed here. Essentially, keep trying until it works. And if it never works, run through an exceptions process.
Once the backup is complete AMQP then dispatches the Processing job to an available processing node.
Processing node either pulls the file to its local file-system or processes it directly from Gluster.
Processing node deposits processing product wherever that goes and reports success to AMQP.

This setup should be able to ingest files at extreme rates of speed given enough servers. Getting 10GbE aggregate ingestion speeds should be doable if you upsize it enough. Of course, processing that much data that fast will require even more servers in your Processing machine-class. This setup should scale up to a thousand nodes, and probably beyond (though how far depends on what, exactly, you're doing with all of this).

The deep engineering challenges will be in the workflow management process hidden inside the AMQP process. That's all software, and probably custom built to your system's demands. But it should be well fed with data!

MadHatter · Answer 2 · 2013-06-19T02:26:34+08:00

MadHatter

2013-06-19T02:26:34+08:002013-06-19T02:26:34+08:00

Given that you've clarified that files will arrive via scp, I don't see any reason for the front-end server to exist at all, as the transport mechanism is something that can be redirected at layer 3.

I'd put an LVS director (pair) in front, with a processing server pool behind and a round-robin redirection policy. That makes it very easy to add and subtract servers to/from the pool, it increases reliability because there's no front-end server to fall over, and it means we don't have to address the pull/push question about getting the files from the front-end to the processing servers because there is no front-end.

Each pool server should then do two things when receiving a file - firstly, copy it to archival storage, then process the file and send it on its way.

3

A roundrobin for incoming files

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?