ehsanul

Asked: 2011-01-15 14:30:24 +0800 CST2011-01-15 14:30:24 +0800 CST 2011-01-15 14:30:24 +0800 CST

Bash Parallelization of CPU-intensive processes

tee forwards its stdin to every single file specified, while pee does the same, but for pipes. These programs send every single line of their stdin to each and every file/pipe specified.

However, I was looking for a way to "load balance" the stdin to different pipes, so one line is sent to the first pipe, another line to the second, etc. It would also be nice if the stdout of the pipes are collected into one stream as well.

The use case is simple parallelization of CPU intensive processes that work on a line-by-line basis. I was doing a sed on a 14GB file, and it could have run much faster if I could use multiple sed processes. The command was like this:

pv infile | sed 's/something//' > outfile

To parallelize, the best would be if GNU parallel would support this functionality like so (made up the --demux-stdin option):

pv infile | parallel -u -j4 --demux-stdin "sed 's/something//'" > outfile

However, there's no option like this and parallel always uses its stdin as arguments for the command it invokes, like xargs. So I tried this, but it's hopelessly slow, and it's clear why:

pv infile | parallel -u -j4 "echo {} | sed 's/something//'" > outfile

I just wanted to know if there's any other way to do this (short of coding it up myself). If there was a "load-balancing" tee (let's call it lee), I could do this:

pv infile | lee >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile) >(sed 's/something//' >> outfile)

Not pretty, so I'd definitely prefer something like the made up parallel version, but this would work too.

2 Answers

Voted

Best Answer

Ole Tange
2011-01-15T16:10:04+08:002011-01-15T16:10:04+08:00
We are discussing how to implement exactly this feature on the mailinglist for GNU Parallel right now http://lists.gnu.org/archive/html/parallel/2011-01/msg00001.html

Feel free to join: http://lists.gnu.org/mailman/listinfo/parallel

A prototype is now ready for testing: http://lists.gnu.org/archive/html/parallel/2011-01/msg00015.html
3
Phil Hollenback
2011-01-15T17:39:16+08:002011-01-15T17:39:16+08:00
I'd look at implementing this in Perl with Parallel::ForkManager. You could do the line splitting in the script and then feed the resulting lines into Parallel::ForkManager processes. The use the run_on_finish callback to collect the output. Obviously for your sed example you could just do the text operation in perl instead and maybe use something like AnyEvent to handle parallelism.
0

Bash Parallelization of CPU-intensive processes

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?