Running multiple scp threads simultaneously:
Background:
I'm often finding myself mirroring a set of server files a lot, and included in these server files are thousands of little 1kb-3kb files. All the servers are connected to 1Gbps ports, generally spread out in a variety of data-centers.
Problem:
SCP transfers these little files, ONE by ONE, and it takes ages, and I feel like I'm wasting the beautiful network resources I have.
Solution?:
I had an idea; Creating a script, which divides the files up into equal amounts, and starts up 5-6 scp threads, which theoretically would then get done 5-6 times faster, no? But I don't have any linux scripting experience!
Question(s):
- Is there a better solution to the mentioned problem?
- Is there something like this that exists already?
- If not, is there someone who would give me a start, or help me out?
- If not to 2, or 3, where would be a good place to start looking to learn linux scripting? Like bash, or other.
I would do it like this:
tar -cf - /manyfiles | ssh dest.server 'tar -xf - -C /manyfiles'
Depending on the files you are transferring it can make sense to enable compression in the
tar
commands:tar -czf - /manyfiles | ssh dest.server 'tar -xzf - -C /manyfiles'
It may also make sense that you choose a CPU friendlier cipher for the
ssh
command (like arcfour):tar -cf - /manyfiles | ssh -c arcfour dest.server 'tar -xf - -C /manyfiles'
Or combine both of them, but it really depends on what your bottleneck is.
Obviously
rsync
will be a lot faster if you are doing incremental syncs.Use
rsync
instead ofscp
. You can usersync
overssh
as easily asscp
, and it supports "pipelining of file transfers to minimize latency costs".One tip: If the data is compressible, enable compression. If it's not, disable it.
I was about to suggest GNO Parallel (which still requires some scripting work on your part), but then I found pscp (which is part of pssh). That may just fit your need.
Not scp directly, but an option for mutli threaded transfer (even on single files) is bbcp - https://www2.cisl.ucar.edu/resources/storage-and-file-systems/bbcp.
use the -s option for the number of threads you want transferring data. Great for high bandwidth but laggy connections, as lag limits the TCP window size per thread.
Possibly unrelated, but if you want something more real time you could try GlusterFS. Works well, but requires some tuning if you're wanting to efficiently read small files.