I am running this command:
pg_dumpall | bzip2 > cluster-$(date --iso).sql.bz2
It takes too long. I look at the processes with top
. The bzip2 process takes about 95% and postgres 5% of one core. The wa
entry is low. This means the disk is not the bottleneck.
What can I do to increase the performance?
Maybe let bzip2 use more cores. The servers has 16 cores.
Or use an alternative to bzip2?
What can I do to increase the performance?
There are many compression algorithms around, and
bzip2
is one of the slower ones. Plaingzip
tends to be significantly faster, at usually not much worse compression. When speed is the most important,lzop
is my favourite. Poor compression, but oh so fast.I decided to have some fun and compare a few algorithms, including their parallel implementations. The input file is the output of
pg_dumpall
command on my workstation, a 1913 MB SQL file. The hardware is an older quad-core i5. The times are wall-clock times of just the compression. Parallel implementations are set to use all 4 cores. Table sorted by compression speed.If the 16 cores of your server are idle enough that all can be used for compression,
pbzip2
will probably give you a very significant speed-up. But you need more speed still and you can tolerate ~20% larger files,gzip
is probably your best bet.Update: I added
brotli
(see TOOGAMs answer) results to the table.brotli
s compression quality setting has a very large impact on compression ratio and speed, so I added three settings (q0
,q1
, andq11
). The default isq11
, but it is extremely slow, and still worse thanxz
.q1
looks very good though; the same compression ratio asgzip
, but 4-5 times as fast!Update: Added
lbzip2
(see gmathts comment) andzstd
(Johnny's comment) to the table, and sorted it by compression speed.lbzip2
puts thebzip2
family back in the running by compressing three times as fast aspbzip2
with a great compression ratio!zstd
also looks reasonable but is beat bybrotli (q1)
in both ratio and speed.My original conclusion that plain
gzip
is the best bet is starting to look almost silly. Although for ubiquity, it still can't be beat ;)Use pbzip2.
The manual says:
It auto-detects the number of processors you have and creates threads accordingly.
Some data:
Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms
CanIUse.com: feature: brotli shows support by Microsoft Edge, Mozilla Firefox, Google Chrome, Apple Safari, Opera (but not Opera Mini or Microsoft Internet Explorer).
Comparison: Brotli vs deflate vs zopfli vs lzma vs lzham vs bzip2
-
If you're looking for compression speed, then what you're looking for is which lines are further right on this chart. (The entries to the top of this chart show tight compression ratio. Higher=tighter. However, if compression speed is your priority, then you'll want to pay more attention to what lines reach further right on the chart.)
Comparison: Compression Ratio vs Compression Speed for 7-Zip ZStandard MethodsYou didn't mention an operating system. If Windows, 7-Zip with ZStandard (Releases) is a version of 7-Zip that has been modified to provide support for using all of these algorithms.
Use zstd. If it's good enough for Facebook, it's probably good enough for you as well.
On a more serious note, it's actually pretty good. I use it for everything now because it just works, and it lets you trade speed for ratio on a large scale (most often, speed matters more than size anyway since storage is cheap, but speed is a bottleneck).
At compression levels that achieve comparable overall compression as bzip2, it's significantly faster, and if you are willing to pay some extra in CPU time, you can almost achieve results similar to LZMA (although then it will be slower than bzip2). At sligthly worse compression ratios, it is much, much faster than bzip2 or any other mainstream alternative.
Now, your are compressing a SQL dump, which is just about as embarrassingly trivial to compress as it can be. Even the poorest compressors score well on that kind of data.
So you can run
zstd
with a lower compression level, which will run dozens of times faster and still achieve 95-99% the same compression on that data.As a bonus, if you will be doing this often and want to invest some extra time, you can "train" the
zstd
compressor ahead of time, which increases both compression ratio and speed. Note that for training to work well, you will need to feed it individual records, not the whole thing. The way the tool works, it expects many small and somewhat similar samples for training, not one huge blob.It looks like adjusting (lowering) the block size can have a significant impact on the compression time.
Here are some results of the experiment I did on my machine. I used the
time
command to measure the execution time.input.txt
is a ~250mb text file containing arbitrary json records.Using the default (biggest) block size (
--best
merely selects the default behaviour):Using the smallest block size (
--fast
argument):This was a bit surprising discovery, considering that the documntation says: