Well, the keyword was parallel. After looking for all compression tools that were also parallel I found the following:
PXZ - Parallel XZ is a compression utility that takes advantage of running LZMA
compression of different parts of an input file on multiple cores and
processors simultaneously. Its primary goal is to utilize all resources to
speed up compression time with minimal possible influence on compression
ratio.
sudo apt-get install pxz
PLZIP - Lzip is a lossless data compressor based on the LZMA algorithm, with very safe
integrity checking and a user interface similar to the one of gzip or bzip2.
Lzip decompresses almost as fast as gzip and compresses better than bzip2,
which makes it well suited for software distribution and data archiving.
Plzip is a massively parallel (multi-threaded) version of lzip using the lzip
file format; the files produced by plzip are fully compatible with lzip.
Plzip is intended for faster compression/decompression of big files on
multiprocessor machines, which makes it specially well suited for distribution
of big software files and large scale data archiving. On files big enough,
plzip can use hundreds of processors.
sudo apt-get install plzip
PIGZ - pigz, which stands for Parallel Implementation of GZip, is a fully functional
replacement for gzip that takes advantage of multiple processors and multiple
cores when compressing data.
sudo apt-get install pigz
PBZIP2 - pbzip2 is a parallel implementation of the bzip2 block-sorting file
compressor that uses pthreads and achieves near-linear speedup on SMP
machines. The output of this version is fully compatible with bzip2
v1.0.2 (ie: anything compressed with pbzip2 can be decompressed with
bzip2).
sudo apt-get install pbzip2
LRZIP - A multithreaded compression program that can achieve very high compression
ratios and speed when used with large files. It uses the combined
compression algorithms of zpaq and lzma for maximum compression, lzo
for maximum speed, and the long range redundancy reduction of rzip.
It is designed to scale with increases with RAM size, improving
compression further. A choice of either size or speed optimizations
allows for either better compression than even lzma can provide, or
better speed than gzip, but with bzip2 sized compression levels.
sudo apt-get install lrzip
A small Compression Benchmark (Using the test Oli created):
There are two main tools. lbzip2 and pbzip2. They're essentially different implementations of bzip2 compressors. I've compared them (the output is a tidied up version but you should be able to run the commands)
cd /dev/shm # we do all of this in RAM!
dd if=/dev/urandom of=bigfile bs=1024 count=102400
$ lbzip2 -zk bigfile
Time: 0m3.596s
Size: 105335428
$ pbzip2 -zk bigfile
Time: 0m5.738s6
Size: 10532460
lbzip2 appears to be the winner on random data. It's slightly less compressed but much quicker. YMMV.
In addition the nice summary above (thanks Luis), these days folks might also want to consider PIXZ, which according to it's README (Source: https://github.com/vasi/pixz -- I haven't verified the claims myself) has some advantages over PXZ.
[Compared to PIXZ, PXZ has these advantages and disadvantages:]
* Simpler code
* Uses OpenMP instead of pthreads
* Uses streams instead of blocks, not indexable
* Uses temp files and doesn't combine them until the whole file is compressed, high disk/memory usage
In other words, PIXZ is supposedly more memory and disk efficient, and has an optional indexing feature that speeds up decompression of individual components of compressed tar files.
Zstandard supports multi-threading since v1.2.0¹. It is a very fast compressor and decompressor intended to replace gzip and it can also compress as efficient (if not better) as LZMA2/XZ on its highest levels.
You have to use one of these releases, or compile the latest version from source to get these benefits. Luckily it doesn't pull in a lot of dependencies.
There was also a 3rd party pzstd in v1.1.0 of zstd.
lzop may also be a viable option, although it's single-threaded.
It uses the very fast lempel-ziv-oberhumer compression algorithm which is 5-6 times faster than gzip in my observation.
Note: Although it's not multi-threaded yet, it will probably outperform pigz on 1-4 core systems. That's why I decided to post this even if it doesn't directly answer your question. Try it, it may solve your CPU bottleneck problem while using only one CPU and compressing a little worse. I found it often to be a better solution than, e.g pigz.
It is not really an answer, but I think it is relevant enough to share my benchmarks comparing speed of gzip and pigz on a real HW in a real life scenario. As pigz is the multithreaded evolution I personally have chosen to use from now on.
time gzip -1kN ./db_dump.sql
real 1m22,271s
user 1m17,738s
sys 0m3,330s
gzip best
time gzip -9kN ./db_dump.sql
real 10m6,709s
user 10m2,710s
sys 0m3,828s
pigz quick
time pigz -1kMN ./db_dump.sql
real 0m26,610s
user 1m55,389s
sys 0m6,175s
pigz best (no zopfli)
time pigz -9kMN ./db_dump.sql
real 1m54,383s
user 14m30,435s
sys 0m5,562s
pigz + zopfli algorithm
time pigz -11kMN ./db_dump.sql
real 171m33,501s
user 1321m36,144s
sys 0m29,780s
As a bottomline I would not recommend the zopfli algorithm since the compression took tremendous amount of time for a not-that-significant amount of disk space spared.
# lzma compression
xz --threads=0
# drop-in parallel gzip replacement
# -p/--processes flag can be used to employ less cores
pigz
# drop-in parallel bzip2 replacement
# -p# flag can be used to employ less cores
# (note: no space between the -p and number of cores)
pbzip2
# modern zstd compression
# is used to build Arch packages by default
# since somewhere 2020
zstd --threads=0
Well, the keyword was parallel. After looking for all compression tools that were also parallel I found the following:
PXZ - Parallel XZ is a compression utility that takes advantage of running LZMA compression of different parts of an input file on multiple cores and processors simultaneously. Its primary goal is to utilize all resources to speed up compression time with minimal possible influence on compression ratio.
sudo apt-get install pxz
PLZIP - Lzip is a lossless data compressor based on the LZMA algorithm, with very safe integrity checking and a user interface similar to the one of gzip or bzip2. Lzip decompresses almost as fast as gzip and compresses better than bzip2, which makes it well suited for software distribution and data archiving.
Plzip is a massively parallel (multi-threaded) version of lzip using the lzip file format; the files produced by plzip are fully compatible with lzip.
Plzip is intended for faster compression/decompression of big files on multiprocessor machines, which makes it specially well suited for distribution of big software files and large scale data archiving. On files big enough, plzip can use hundreds of processors.
sudo apt-get install plzip
PIGZ - pigz, which stands for Parallel Implementation of GZip, is a fully functional replacement for gzip that takes advantage of multiple processors and multiple cores when compressing data.
sudo apt-get install pigz
PBZIP2 - pbzip2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 (ie: anything compressed with pbzip2 can be decompressed with bzip2).
sudo apt-get install pbzip2
LRZIP - A multithreaded compression program that can achieve very high compression ratios and speed when used with large files. It uses the combined compression algorithms of zpaq and lzma for maximum compression, lzo for maximum speed, and the long range redundancy reduction of rzip. It is designed to scale with increases with RAM size, improving compression further. A choice of either size or speed optimizations allows for either better compression than even lzma can provide, or better speed than gzip, but with bzip2 sized compression levels.
sudo apt-get install lrzip
A small Compression Benchmark (Using the test Oli created):
ORIGINAL FILE SIZE - 100 MB
PBZIP2 - 101 MB (1% Bigger)
PXZ - 101 MB (1% Bigger)
PLZIP - 102 MB (1% Bigger)
LRZIP - 101 MB (1% Bigger)
PIGZ - 101 MB (1% Bigger)
A small Compression Benchmark (Using a Text file):
ORIGINAL FILE SIZE - 70 KB Text File
PBZIP2 - 16.1 KB (23%)
PXZ - 15.4 KB (22%)
PLZIP - 15.5 KB (22.1%)
LRZIP - 15.3 KB (21.8%)
PIGZ - 17.4 KB (24.8%)
There are two main tools.
lbzip2
andpbzip2
. They're essentially different implementations of bzip2 compressors. I've compared them (the output is a tidied up version but you should be able to run the commands)lbzip2
appears to be the winner on random data. It's slightly less compressed but much quicker. YMMV.Update:
XZ Utils supports multi-threaded compression since v5.2.0, it was originally mistakenly documented as being multi-threaded decompression.
For example:
tar -cf - source | xz --threads=0 > destination.tar.xz
In addition the nice summary above (thanks Luis), these days folks might also want to consider PIXZ, which according to it's README (Source: https://github.com/vasi/pixz -- I haven't verified the claims myself) has some advantages over PXZ.
In other words, PIXZ is supposedly more memory and disk efficient, and has an optional indexing feature that speeds up decompression of individual components of compressed tar files.
Zstandard supports multi-threading since v1.2.0¹. It is a very fast compressor and decompressor intended to replace gzip and it can also compress as efficient (if not better) as LZMA2/XZ on its highest levels.
You have to use one of these releases, or compile the latest version from source to get these benefits. Luckily it doesn't pull in a lot of dependencies.
There was also a 3rd party pzstd in v1.1.0 of zstd.
lzop may also be a viable option, although it's single-threaded.
It uses the very fast lempel-ziv-oberhumer compression algorithm which is 5-6 times faster than gzip in my observation.
Note: Although it's not multi-threaded yet, it will probably outperform pigz on 1-4 core systems. That's why I decided to post this even if it doesn't directly answer your question. Try it, it may solve your CPU bottleneck problem while using only one CPU and compressing a little worse. I found it often to be a better solution than, e.g pigz.
It is not really an answer, but I think it is relevant enough to share my benchmarks comparing speed of
gzip
andpigz
on a real HW in a real life scenario. Aspigz
is the multithreaded evolution I personally have chosen to use from now on.Metadata:
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
(4c/8t) + Nvme SSDXubuntu 17.10 (artful)
gzip
version:1.6
pigz
version:2.4
gzip
quickgzip
bestpigz
quickpigz
best (nozopfli
)pigz
+zopfli
algorithmAs a bottomline I would not recommend the
zopfli
algorithm since the compression took tremendous amount of time for a not-that-significant amount of disk space spared.Resulting file sizes:
The LZMA2 compressor of p7zip uses both cores on my system.
Relevant Arch Wiki entry: https://wiki.archlinux.org/index.php/Makepkg#Utilizing_multiple_cores_on_compression