Ping a Specific Port

Question

Jon

Asked: 2010-06-18 17:08:29 +0800 CST2010-06-18 17:08:29 +0800 CST 2010-06-18 17:08:29 +0800 CST

Parallel File Copy

772

I have a list of files I need to copy on a Linux system - each file ranges from 10 to 100GB in size.

I only want to copy to the local filesystem. Is there a way to do this in parallel - with multiple processes each responsible for copying a file - in a simple manner?

I can easily write a multithreaded program to do this, but I'm interested in finding out if there's a low level Linux method for doing this.

7 Answers

Voted

Ole Tange · Answer 1 · 2010-06-23T02:59:41+08:00

Ole Tange

2010-06-23T02:59:41+08:002010-06-23T02:59:41+08:00

If you system is not thrashed by it (e.g. maybe the files are in cache) then GNU Parallel http://www.gnu.org/software/parallel/ may work for you:

find . -type f -print0 | parallel -0 -j10 cp {} destdir

This will run 10 concurrent cps.

Pro: It is simple to read.

Con: GNU Parallel is not standard on most systems - so you probably have to install it.

If you want to keep the directory structure:

find . -type f -print0 |
  parallel -0 -j10 mkdir -p destdir/{//}';' cp {} destdir/{//}

Watch the intro video for more info: http://www.youtube.com/watch?v=OpaiGYxkSuQ

See also https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/ for a discussion of parallel disk I/O.

15

Ignacio Vazquez-Abrams · Answer 2 · 2010-06-18T17:16:09+08:00

Ignacio Vazquez-Abrams

2010-06-18T17:16:09+08:002010-06-18T17:16:09+08:00

There is no low-level mechanism for this for a very simple reason: doing this will destroy your system performance. With platter drives each write will contend for placement of the head, leading to massive I/O wait. With SSDs, this will end up saturating one or more of your system buses, causing other problems.

3

Christopher Karel · Answer 3 · 2010-06-18T17:35:21+08:00

Christopher Karel

2010-06-18T17:35:21+08:002010-06-18T17:35:21+08:00

As mentioned, this is a terrible idea. But I believe everyone should be able to implement their own horrible plans, sooo...

for FILE in *;do cp $FILE <destination> &;done

The asterisk can be replaced with a regular expression of your files, or $(cat <listfile>) if you've got them all in a text document. The ampersand kicks off a command in the background, so the loop will continue, spawning off more copies.

As mentioned, this will completely annihilate your IO. So...I really wouldn't recommend doing it.

--Christopher Karel

3

Slartibartfast · Answer 4 · 2010-06-18T18:43:30+08:00

Slartibartfast

2010-06-18T18:43:30+08:002010-06-18T18:43:30+08:00

The only answer that will not trash your machine's responsiveneess isn't exactly a 'copy', but it is very fast. If you won't be editing the files in the new or old location, then a hard link is effectively like a copy, and (only) if you're on the same filesystem, they are created very very very fast.

Check out cp -l and see if it will work for you.

3

Jon Bringhurst · Answer 5 · 2012-12-20T17:29:31+08:00

Jon Bringhurst

2012-12-20T17:29:31+08:002012-12-20T17:29:31+08:00

Here's a distributed/parallel and decentralized file copy tool that will chunk up the file and copy all of the chunks in parallel. It'll probably only help you if you have an SSD that supports multiple streams or some sort of setup with multiple disk heads.

https://github.com/hpc/dcp

3

johann peyrard · Answer 6 · 2020-03-22T04:59:59+08:00

For the people who think that's not a great idea, I would say it depend. You can have a big raid system or a parallel filesystem which will deliver really better performance than one cp process can handle. Then yes, you need to use a "parallel tool".

Let's take this example :

timeout 10 strace -e write -c cp /dev/zero /dev/null
strace: Process 24187 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.655188           4    166222           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.655188                166222           total

then this

timeout 0.01 strace -e write  cp /dev/zero /dev/null
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
strace: Process 24567 detached

so each syscall write made by "cp" in this case is 64KiB and for 10s on my system I am able to deliver this bandwidth : 65536*166222/10 = 1089352499 =~ 1,08GB/s

Now, let's launch this workload with 2 process ( I have 4 core but my desktop is used for other stuff, and here it's just an example ) :

timeout 10 strace -e write -c cp /dev/zero /dev/null & timeout 10 strace -e write -c cp /dev/zero /dev/null &  wait
[1] 26106
[2] 26107
strace: Process 26113 detached
strace: Process 26112 detached
% time     seconds  usecs/call     calls    errors syscall
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
------ ----------- ----------- --------- --------- ----------------
100.00    0.624108           4    162616           write
100.00    0.638468           4    162451           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.624108                162616           total
100.00    0.638468                162451           total
------ ----------- ----------- --------- --------- ----------------
[1]-  Exit 124                timeout 10 strace -e write -c cp /dev/zero /dev/null

So we see we are able to near double the performance using 2 core to launch this.

So if we are in a context different than 1xHard drive to 1xHard drive but a raid array ( or multiple NVMe so not the most common case I agree but I work on this every day ), it show definitely a better performance to use multiple common in parallel.

user3749113 · Answer 7 · 2018-07-19T13:12:21+08:00

user3749113

2018-07-19T13:12:21+08:002018-07-19T13:12:21+08:00

You should try this:

    $ seq 3 | parallel cp -v /etc/passwd passwd{}

This will copy the file passwd 3 times from /etc/ directory to your $HOME

Or if your file it is in your home directory

    $ seq 3 | parallel cp -v passwd{,{}}

This will copy the file passwd 3 times into your $HOME

-1

Parallel File Copy

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?