Ping a Specific Port

Question

sanity

Asked: 2009-09-05 13:34:00 +0800 CST2009-09-05 13:34:00 +0800 CST 2009-09-05 13:34:00 +0800 CST

Processing large files through bash pipes, does it buffer?

772

I need to use a command like the following:

$ cat large.input.file | process.py > large.output.file

The problem is, as this is won't the hard disk have a hard time jumping between reading the input file and writing the output file?

Is there a way to tell bash to use a large memory buffer when doing this kind of pipe?

7 Answers

Voted

David Spillett · Answer 1 · 2009-09-05T14:38:41+08:00

The OS will buffer the output to a certain amount, but there may still be a lot of head flipping if both the input and output files are on the same drive, unless your process.py does some buffering of its own.

You could replace cat in your example with pipe viewer (pv) (available in most standard repositories, and easily complied if it isn't in your distribution's repo) which allows you to set it to buffer more (with the -B/--buffer-bytes options) and displays a progress bar (unless you ask it not to) which could be very handy for a long operation if your process.py doesn't output its own progress information. For passing data from one place on a drive to an other place on the same drive this can make quite a difference unless the overall process is primarily CPU bound rather than I/O bound.

So for a 1Mb buffer you could do:

pv -B 1m large.input.file | process.py > large.output.file

I use pv all the time for this sort of thing, though mainly for the progress indicator more than the tweakable buffer size.

Another option is to use the more "standard" (standard in terms of being generally available by default, its command line format is a little different to most common commands) dd, though this does not have the progress bar facility:

dd if=large.input.file bs=1048576 | process.py > large.output.file

Edit: ps. pendants may point out that cat is not needed in your example as the following will work just as well and will be very slightly more efficient:

process.py < large.input.file > large.output.file

Some people refer to the removal of uneccassary calls to cat as "demogification", those these people should probably not be encouraged...

hurikhan77 · Answer 2 · 2009-09-05T13:49:14+08:00

hurikhan77

2009-09-05T13:49:14+08:002009-09-05T13:49:14+08:00

Isn't there an old unix tool called "buffer"? Not that this would be needed with todays caching techniques - but it is there.

5

Ludwig Weinzierl · Answer 3 · 2009-09-05T13:38:39+08:00

Best Answer

Ludwig Weinzierl

2009-09-05T13:38:39+08:002009-09-05T13:38:39+08:00

Don't worry. The OS will do the buffering for you, and it's usually very good at it.

That being said: If you can change process.py you can implement your own buffering. If you can't change process.py you could write your own buffer.py and use it like this.

$ cat large.input.file | buffer.py | process.py | buffer.py > large.output.file

Probably much easier would be to read and write form a RAM disk.

4

mdpc · Answer 4 · 2009-09-05T15:46:28+08:00

I believe that the problem that the user is aluding to is how the Input/Output generally works in the UNIX/Linux world. Each UNIX/Linux process can basically have only one I/O operation pending at a time. Thus in the case of the cat command in the example, the cat command first reads some data, waits for it to complete, then writes the data and waits for it to complete before continuing. There is no concurrent I/O within a processes thus the buffering is used only between the reads and the writes to simply temporarily hold some data.

To speed things up, the input and output can be broken down over two different processes: one reader process and one writer process, and a lot of shared memory used as the buffer between the two processes. This results in the concurrent I/O that one desires and can speed up the file transfer process.

The utiilty program buffer, indicated by a user, implements this concurrent method that I described. I have used the buffer program with a fairly large shared memory buffer when interfacing with a tape drive for backups. This resulted in about a 20% decrease in wall clock transfer time.

Using the buffer program as a replacement for the 'cat' command might result in some definite improvements ... depending.

Enjoy!

RAKK · Answer 5 · 2014-05-21T11:56:26+08:00

Try using this little Python 2 program I just put together:

#! /usr/bin/python2
# This executable path is Gentoo-specific, you might need to change it yourself

import sys;

if sys.argv[1].endswith('K'):
   bytestoread = int(sys.argv[1].translate(None, 'K')) * 1024;
elif sys.argv[1].endswith('M'):
   bytestoread = int(sys.argv[1].translate(None, 'M')) * 1024 * 1024;
elif sys.argv[1].endswith('G'):
   bytestoread = int(sys.argv[1].translate(None, 'G')) * 1024 * 1024 * 1024;

while True:
   buffer = sys.stdin.read(bytestoread);
   if buffer == '':
      exit();
   sys.stdout.write(buffer);
   buffer = None;   # This one is for making sure the read buffer will get destroyed, otherwise we could bring our system to a halt if we have 8 GB of RAM, request a 5 GB buffer, and it ends up eating 10 GB of memory.

To use this file, call it like this:

cat large.input.file | process.py | buffer.py 2G > large.output.file

You can use 2K to specify 2 kilobytes, 2M for 2 megabytes, 2G for 2 gigabytes, if you want to you might add 2T for 2 terabytes of buffer :3

I get this problem all the time when compressing a virtual machine image with pigz -1, because this makes the compression so incredibly fast the disk starts reading and writing simultaneously and the process slows to a grinding halt as the disk's head starts whizzing between the input and output files. So what I did was making this little program that reads a gargantuan block of data from standard input, writes it to standard output, and repeats. When the read returns a blank string, it's because no more standard input has been received and the script finishes.

Bill Weiss · Answer 6 · 2009-09-05T13:37:41+08:00

Bill Weiss

2009-09-05T13:37:41+08:002009-09-05T13:37:41+08:00

It will buffer intelligently, but there's no good way to tweak how much that is.

You could write an intermediate program that would do the caching you want and have it read from the input.

0

X-Istence · Answer 7 · 2009-09-05T13:40:01+08:00

X-Istence

2009-09-05T13:40:01+08:002009-09-05T13:40:01+08:00

Your OS will do all kinds of caching on files before they are written to the hard drive, and will also do caching on the file being read (generally reading ahead if possible). Let the OS do the buffering and caching.

Until you can prove using testing and profiling that the hard drive is the limiting factor in the equation it is best to just leave it alone.

If you can change process.py you could instead of reading/writing with pipes read/write directly and use buffering and or memmap'ed files instead which would help remove some of the load from the system.

-1

Processing large files through bash pipes, does it buffer?

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?