A bunch of new files with unique filenames regularly "appears"1 on one server. (Like hundreds GBs of new data daily, solution should be scalable to terabytes. Each file is several megabytes large, up to several tens of megabytes.)
There are several machines that process those files. (Tens, should solution be scalable to hundreds.) It should be possible to easily add and remove new machines.
There are backup file storage servers on which each incoming file must be copied for archival storage. The data must not be lost, all incoming files must end up delivered on the backup storage server.
Each incoming file myst be delivered to a single machine for processing, and should be copied to the backup storage server.
The receiver server does not need to store files after it sent them on their way.
Please advise a robust solution to distribute the files in the manner, described above. Solution must not be based on Java. Unix-way solutions are preferable.
Servers are Ubuntu-based, are located in the same data-center. All other things can be adapted for the solution requirements.
1Note that I'm intentionally omitting information about the way files are transported to the filesystem. The reason is that the files are being sent by third parties by several different legacy means nowadays (strangely enough, via scp, and via ØMQ). It seems easier to cut the cross-cluster interface at the the filesystem level, but if one or another solution actually will require some specific transport — legacy transports can be upgraded to that one.
Here is one solution to what you're looking for. No java is involved in the making of this system, just readily available Open Source bits. The model presented here can work with other technologies than the ones I'm using as an example.
This setup should be able to ingest files at extreme rates of speed given enough servers. Getting 10GbE aggregate ingestion speeds should be doable if you upsize it enough. Of course, processing that much data that fast will require even more servers in your Processing machine-class. This setup should scale up to a thousand nodes, and probably beyond (though how far depends on what, exactly, you're doing with all of this).
The deep engineering challenges will be in the workflow management process hidden inside the AMQP process. That's all software, and probably custom built to your system's demands. But it should be well fed with data!
Given that you've clarified that files will arrive via scp, I don't see any reason for the front-end server to exist at all, as the transport mechanism is something that can be redirected at layer 3.
I'd put an LVS director (pair) in front, with a processing server pool behind and a round-robin redirection policy. That makes it very easy to add and subtract servers to/from the pool, it increases reliability because there's no front-end server to fall over, and it means we don't have to address the pull/push question about getting the files from the front-end to the processing servers because there is no front-end.
Each pool server should then do two things when receiving a file - firstly, copy it to archival storage, then process the file and send it on its way.