We have our log files gzipped to save space. Normally we keep them compressed and just do
gunzip -c file.gz | grep 'test'
to find important information but we're wondering if it's quicker to keep the files uncompressed and then do the grep.
cat file | grep 'test'
There has been some discussions about how gzip works if it would make sense that if it reads it into memory and unzips then the first one would be faster but if it doesn't then the second one would be faster. Does anyone know how gzip uncompresses data?
It's always going to be quicker to cat the uncompressed file as there's no overhead associated with that. Even if you're not writing a temporary file, you're going through the decompression motions, which munch CPU. If you're accessing these files often enough, it's probably better to keep them uncompressed if you have the space.
That said, dumping data to standard out (gunzip -c, zcat, etc...) won't trigger writing to a temporary file. The data is piped directly to the
grep
command, which treats the uncompressed stream as it's own standard in.The Wikipedia article on LZ* encoding is here: http://en.wikipedia.org/wiki/LZ77_and_LZ78.
As always, nothing beats actual measurement.
Your mileage may vary, but on my system, grepping an already uncompressed file took about a third the time that piping
zcat
orgunzip
intogrep
did. This isn't surprising.Using compression could actually deliver faster throughput to disks, but that depends on a number of factors, including the compression algorithm used and the kind of data you're moving around. ZFS, for example, heavily relies on this assumption.
gzip will either decompress the whole file to a temporary one and rename it in the end (standard
gzip -d myfile.gz
) or not use any temporary file at all reading some blocks of compressed data at a time and spitting uncompressed data on stdout (gzip -d -c...
).On a modern system I suspect a
gunzip | grep
could be faster than grepping an uncompressed file, on the other handgunzip | grep
will always win over decompressing a file and then grepping the uncompressed one :)You can also substitute gzip with lzo to improve performance.
Using of LZO can make things faster (lesser disk input-ouput and little compression CPU overhead)
gzip -dc | grep foo (or gunzip -c) | grep foo writes to a pipe. How the pipe is implemented is dependent on your operating system, but generally it will stay in memory. As others have pointed out, grepping an uncompressed file is always going to be faster due to the time it takes to decompress the compressed data. Using a different compression program may or may not improve performance; you can always measure it.
Depends on file size: when I/O dominates, then the CPU of doing the decompress takes less time than the file transfer. Whether I/O will dominate depends heavily on the relative speeds of your CPU, your storage systems, and the bandwidth between them.
Also, as an aside,
grep -Z
akazgrep
is also handy.