Ping a Specific Port

Question

aosho235

Asked: 2016-05-31 21:06:02 +0800 CST2016-05-31 21:06:02 +0800 CST 2016-05-31 21:06:02 +0800 CST

GNU parallel doesn't fully utilize my CPUs

772

I'm running a command like this on my 36 core server (EC2 c4.8xlarge/Amazon Linux).

find . -type f | parallel -j 36 mycommand

The number of files to process is ~1,000,000, and it takes dozens of minutes. It should run 36 processes simultaneously. However, from the result of top, there are about 10 processes at most, and 70% is idle. ps shows more processes, but most of them are defunct.

I guessed it was because each mycommand finished so quickly, parallel could not catch up spawning new processes. So I tried parallel --nice 20 to allocate more CPU time to parallel itself, but this didn't work.

Does anyone have an idea to improve this?

$ parallel --version GNU parallel 20151022

3 Answers

Voted

Ole Tange · Answer 1 · 2016-06-01T14:19:27+08:00

Best Answer

Ole Tange

2016-06-01T14:19:27+08:002016-06-01T14:19:27+08:00

The number of files to process is ~1,000,000, and it takes dozens of minutes.

So you are running around 600 jobs per second. The overhead for a single GNU Parallel job is in the order of 2-5 ms, so when you are getting more than 200 jobs per second, GNU Parallel will not perform better without tweaking.

The tweak is to have more parallels spawining jobs in parallel. From https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround

cat myinput | parallel --pipe -N 100 --round-robin -j50 parallel -j100 your_prg

This way you will have 50 GNU Parallel that can each spawn 100 jobs per second.

3

Hristo Mohamed · Answer 2 · 2016-05-31T21:34:54+08:00

Hristo Mohamed

2016-05-31T21:34:54+08:002016-05-31T21:34:54+08:00

Eh, if I understood your questions you want to process all the files simultaniously?
parallel will launch multiple instances of mycommand , not multiple find instances.

0

Morpheu5 · Answer 3 · 2016-05-31T23:40:08+08:00

Morpheu5

2016-05-31T23:40:08+08:002016-05-31T23:40:08+08:00

You are trying to open a million files, 36 at a time. Even if your command could run at full power on one CPU, you'd still incur in the overhead of opening those files in the first place. I/O is one of the most time-expensive operations on computers. Your best bet would be to load as many of those files beforehand into your machine's RAM, and work in RAM as much as possible. Depending on how much RAM you have, this may improve performance significantly, because once a read is started, subsequent reads tend to leverage on caching if done immediately one after the other. You may also want to make sure your filesystem lays files down in a cache-efficient way, and also that it is a good fs when it comes to multiple subsequent reads.

I don't think parallel is going to help you much with this refactoring.

0

GNU parallel doesn't fully utilize my CPUs

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?