I need to use a command like the following:
$ cat large.input.file | process.py > large.output.file
The problem is, as this is won't the hard disk have a hard time jumping between reading the input file and writing the output file?
Is there a way to tell bash to use a large memory buffer when doing this kind of pipe?
The OS will buffer the output to a certain amount, but there may still be a lot of head flipping if both the input and output files are on the same drive, unless your
process.py
does some buffering of its own.You could replace
cat
in your example with pipe viewer (pv) (available in most standard repositories, and easily complied if it isn't in your distribution's repo) which allows you to set it to buffer more (with the-B
/--buffer-bytes
options) and displays a progress bar (unless you ask it not to) which could be very handy for a long operation if yourprocess.py
doesn't output its own progress information. For passing data from one place on a drive to an other place on the same drive this can make quite a difference unless the overall process is primarily CPU bound rather than I/O bound.So for a 1Mb buffer you could do:
I use
pv
all the time for this sort of thing, though mainly for the progress indicator more than the tweakable buffer size.Another option is to use the more "standard" (standard in terms of being generally available by default, its command line format is a little different to most common commands)
dd
, though this does not have the progress bar facility:Edit: ps. pendants may point out that
cat
is not needed in your example as the following will work just as well and will be very slightly more efficient:Some people refer to the removal of uneccassary calls to
cat
as "demogification", those these people should probably not be encouraged...Isn't there an old unix tool called "buffer"? Not that this would be needed with todays caching techniques - but it is there.
Don't worry. The OS will do the buffering for you, and it's usually very good at it.
That being said: If you can change process.py you can implement your own buffering. If you can't change process.py you could write your own buffer.py and use it like this.
Probably much easier would be to read and write form a RAM disk.
I believe that the problem that the user is aluding to is how the Input/Output generally works in the UNIX/Linux world. Each UNIX/Linux process can basically have only one I/O operation pending at a time. Thus in the case of the cat command in the example, the cat command first reads some data, waits for it to complete, then writes the data and waits for it to complete before continuing. There is no concurrent I/O within a processes thus the buffering is used only between the reads and the writes to simply temporarily hold some data.
To speed things up, the input and output can be broken down over two different processes: one reader process and one writer process, and a lot of shared memory used as the buffer between the two processes. This results in the concurrent I/O that one desires and can speed up the file transfer process.
The utiilty program buffer, indicated by a user, implements this concurrent method that I described. I have used the buffer program with a fairly large shared memory buffer when interfacing with a tape drive for backups. This resulted in about a 20% decrease in wall clock transfer time.
Using the buffer program as a replacement for the 'cat' command might result in some definite improvements ... depending.
Enjoy!
Try using this little Python 2 program I just put together:
To use this file, call it like this:
You can use 2K to specify 2 kilobytes, 2M for 2 megabytes, 2G for 2 gigabytes, if you want to you might add 2T for 2 terabytes of buffer :3
I get this problem all the time when compressing a virtual machine image with
pigz -1
, because this makes the compression so incredibly fast the disk starts reading and writing simultaneously and the process slows to a grinding halt as the disk's head starts whizzing between the input and output files. So what I did was making this little program that reads a gargantuan block of data from standard input, writes it to standard output, and repeats. When the read returns a blank string, it's because no more standard input has been received and the script finishes.It will buffer intelligently, but there's no good way to tweak how much that is.
You could write an intermediate program that would do the caching you want and have it read from the input.
Your OS will do all kinds of caching on files before they are written to the hard drive, and will also do caching on the file being read (generally reading ahead if possible). Let the OS do the buffering and caching.
Until you can prove using testing and profiling that the hard drive is the limiting factor in the equation it is best to just leave it alone.
If you can change process.py you could instead of reading/writing with pipes read/write directly and use buffering and or memmap'ed files instead which would help remove some of the load from the system.