I've always used TAR and ZIP for compression, but recently I have heard about the *.Z
compression algorithm. This brought up a question for me:
With all of these compression systems, which one is best for general use and compression?
Running a few tests, I have discovered that tar
, as I discovered, does NOT really compress (unless explicitly specified). Meaning, what is it good for compared to other compression methods?
I am already aware that ZIP is the most widely-used compression system, but should I use it instead of *.Z
, *.7z
, .tar
, or .tar.<insert ending here>
?
Post Summary:
- Should I use
*.tar
,*.Z
,*.7z
,.tar
, or.tar.<insert ending here>
for the best compression? - If plain
*.tar
doesn't compress, why do we use it?
EDIT: Not all algorithms allow storing of Linux permissions (from what I learned). Which do, and is there some sort of hack (or script) I could use to store permissions?
tar
stands for tape archive. All it does is pack files, and their metadata ( permissions, ownership, etc ) into a stream of bytes that can be stored on a tape drive ( or a file ) and restored later. Compression is an entirely separate matter that you used to have to pipe the output through an external utility to compress if wanted that. GNU tar was nice enough to add switches to tell it to automatically filter the output through the appropriate utility as a shortcut.Zip and 7z combine the archiving and compression together into their own container format, and they are meant to pack files on a DOS/Windows system, so they do not store unix permissions and ownership. Thus if you want to store permissions for proper backups, you need to stick with tar. If you plan on exchanging files with Windows users, then zip or 7z is good. The actual compression algorithms zip and 7zip use can be used with tar, by uzing
gzip
andlzma
respectively.lzma ( aka. *.xz ) has one of the best compression ratios, and is quite fast at decompression, making it a top choice these days. It does however, require a ton of ram and cpu time to compress. The venerable
gzip
is quite a bit faster at compression, so may be used if you don't want to dedicate that much cpu time. It also has an even faster variant called lzop.bzip2
is still fairly popular as it largely replaced gzip for a time before 7zip/lzma came about, since it got better compression ratios, but is falling out of favor these days since 7z/lzma is faster at decompression and gets better compression ratios. Thecompress
utility, which normally names files *.Z, is ancient and long forgotten.One of the other important differences between zip and tar is that zip compresses the data in small chunks, whereas when you compress a tar file, you compress the whole thing at once. The latter gives better compression ratios, but in order to extract a single file at the end of the archive, you must decompress the whole thing to get to it. Thus the zip format is better at extracting a single file or two from a large archive. 7z and
dar
allow you to choose to compress the whole thing ( called "solid" mode ) or small chunks for easy piecemeal extraction.The details of the algorithms are off topic here1 since they are not in any way specific to Linux, let alone Ubuntu. You will, however, find some nice info here.
Now on to
tar
, as you said,tar
is not and never has been a compression program. Instead, it is an archiver; its primary purpose is to make one big file out of a lot of small ones. Historically this was to facilitate storing on tape drives, hence the name: Tape ARchive.Today, the primary reason to use
tar
is to decrease the number of files on your system. Each file on a Unix file system takes up an inode, the more files you have, the fewer inodes available and when you run out of inodes, you can no longer create new files. To put it simply, the same amount of data stored as thousands of files will take up more of your hard drive than those same files in a single tar archive.To illustrate, since this has been contested in the comments, on my 68G
/
partition, I have the following number of total and used inodes (bear in mind that inode count depends on the file system type and the size of the partition):If I now proceed to attempt to create more files than I have inodes:
No space? But I have loads of space:
As you can see above, creating a few hundred thousand empty files rapidly depletes my inodes and I can no longer create new ones. If I were to
tar
these I would be able to start creating files again.Having fewer files also greatly speeds up the file system I/O especially on NFS mounted filesystems. I always tar my old work directories when a project is finished since the fewer files I have, the faster programs like
find
will work.There is a great answer on Super User that goes into far more detail, but in addition to the above, the other basic reasons why
tar
is still popular today are:Efficiency: using
tar
to pipe through a compression program likegzip
is more efficient since it avoids the creation of intermediate files.tar
comes with all sorts of bells and whistles, features that have been designed over its long history that make it particularly useful for *nix backups (think permissions, file ownership, the ability to pipe data straight to STDOUT and over an SSH link...)Inertia. We're used to
tar
. It's safe to assume it will be available on any *nix you might happen to use which makes it very portable and handy for source code tarballs.1 This is absolutely true and has nothing to do with the fact that I don't know enough about them to explain :)
There are two distinct but related tasks. Packing a tree of files (including filenames, directory structure, filesystem permissions, ownership and any other metadata) into a byte stream is called archiving. Removing redundancy in a byte stream to produce a smaller byte stream is called compression.
On Unix, the two operations are separated, with distinct tools for each. On most other platforms (current and historical) combined tools perform both archiving and compression.
(gzip and other programs that mimic gzip's interface often have the option to store the original filename in the compressed output, but this, along with a CRC or other check to detect corruption, is the only metadata they can store.)
There are advantages to separating compression from archiving. Archiving is platform-specific (the filesystem metadata needing preserving varies widely), but the implementation is straightforward, largely I/O-bound, and changes little over time. Compression is platform-independent, but implementations are CPU-bound and algorithms are constantly improving to take advantage of the increased resources that modern hardware can bring to bear on the problem.
The most popular Unix archiver is
tar
, although there exist others such ascpio
andar
. (Debian packages arear
archives, whilecpio
is often used for inital ramdisks.)tar
is or has often been combined with compression tools such ascompress
(.Z),gzip
(.gz),bzip2
(.bz2) andxz
(.xz), from oldest to youngest, and not coincidentally from worst to best compression.Making a
tar
archive and compressing it are distinct steps: the compressor knows nothing about thetar
file format. This means that extracting a single file from a compressedtar
archive requires decompressing all of the preceding files. This is often called a "solid" archive.Equally, since tar is a "streaming" format--required for it to be useful in a pipeline--there is no global index in a tar archive, and listing the contents of a tar archive is just as expensive as extracting it.
By contrast, Zip and RAR and 7-zip (the most popular archivers on modern Windows platforms) usually compress each file separately, and compress metadata lightly if at all. This allows for cheap listing of the files in an archive and extraction of individual files, but means that redundancy between multiple files in the same archive cannot be exploited to increase compression. While in general compressing an already-compressed file does not reduce file size further, occasionally you might see a zip file within a zip file: the first zipping turned lots of small files into one big file (probably with compression disabled), which the second zipping then compressed as a single entity.
There is cross-pollination between the differing platforms and philosophies:
gzip
is essentiallyzip
's compressor without its archiver, andxz
is essentially7-zip
's compressor without its archiver.There are other, specialized compressors. PPM variants and their successor
ZPAQ
are optimized for maximum compression without regard to resource consumption. They can easily chew up as much CPU and RAM as you can throw at them, and decompression is just as taxing as compression (for contrast, most widely-used compression tools are asymmetric: decompressing is cheaper than compressing).On the other end of the spectrum,
lzo
,snappy
andLZ4
are "light" compressors designed for maximum speed and minimum resource consumption, at the cost of compression. They're widely used within filesystems and other object stores, but less so as standalone tools.So which should you pick?
Archiving:
Since you're on Ubuntu there's no real reason to use anything other than
tar
for archiving, unless you're trying to make files that are easily readable elsewhere.zip
is hard to beat for ubiquity, but it's not Unix-centric and will not keep your filesystem permissions and ownership information, and its baked-in compression is antiquated. 7-zip and RAR (and ZPAQ) have more modern compression but are equally unsuited to archiving Unix filesystems (although there's nothing stopping you using them just as compressors); RAR is also proprietary.Compression:
For maximum compression you can have a look at a benchmark, such as the enormous one at http://mattmahoney.net/dc/text.html. This should give you a better idea of the tradeoffs involved.
You probably don't want maximum compression, though. It's way too expensive.
xz
is the most popular general-purpose compression tool on modern Unix systems. I believe 7-zip can read xz files too, as they are closely related.Finally: if you're archiving data for anything other than short-term storage you should pick something open-source and preferably widespread, to minimize headaches later on.
lzo, gz, b2, lzma (.lzma2 =.xz)
are "stream" compressors: they compress a stream of byes an don't know and don't care about files, directories and metadata like permissions. You have to use an archiver like tar to bundle all that data into a stream of bytes (a tar file) and compress that with a compressor. If it is the data of a single file you care about, you could also feed that file alone to one of these compressors.Tar, cpio and pax
are archivers: they take a bunch of files and directories and encode the data and metadata in a single file. tar is the most popular and most compatible though the technical merits between the three are minimal enough that there were religious wars about it during the dawn of time.7z and zip are compressors AND arcihvers: Then store all the data and meta data and compress it. However AFAICT, neither of them save unix permissions.
Zip uses the same algorithm as gzip called DEFLATE. 7z uses the lzma algorithm
to read a single file from a tar.gz or the like, you will need to decompress the whole gz stream till the enough of the tar file is exposed so you can extract it. Zip allows you to compress and pull out each file individually. 7z can have either behavior.
Compression ratios and speeds: gzip and lzo have very very fast compression and decompression speeds but low compression ratios. It also does not take much memory to compress. gzip is a little slower and gives a little better compression ratio than lzo.
It is so fast, it can be faster to read a gz or lzo compressed file from the disk and decompress it on the fly instead of reading the uncompressed file directly from the disk.
LZMA (xz) gives excellent compression on general data but takes very long to compress and decompress along with taking significant amounts of memory to compress.
bz2 used to be the high compression algorithm of choice but fell out of favour as it is both slower than lzma and takes longer to compress and decompress. However for certain kinds of data (dna sequences, files with very large runs of the same byte etc) bzip2 can beat everything else hands down. As an example, I once had to compress a 4GB file of 1's and b2 reduced i to a few 10's of kb while lzma took some 10's of MBs if I remember correctly.
For especially large files, you can use
rzip
. It first looks at redundant data inside of 900 MB large blocks, encodes these, and then hands the data over to bzip2 (not really, but the same algorithms are used).Effect? Much faster than
xz
,lzma
orbzip2
, and in my experience its compression ratio rivals that oflzma
. It is a RAM hog, though.http://en.wikipedia.org/wiki/Rzip
gzip
's compression algorithm has been the traditional best-known most-used compression algorithm a long time. (zlib
is a library that implements it.)bzip2
was invented later and was suggested as an algorithm that frequently might give better compression ratios thangzip
on the usual data, however, it was more slow (computation-costly) compared togzip
.bzip2
as an alternative togzip
has been recently mostly obsoleted by modern algorithms.For example,
xz -0
is stated in its manpage (man xz
) to be "sometimes faster thangzip -9
while compressing much better".There are also other modern algorithms that are well-suited for on-the-fly compression and decompression (apart from
gzip
) by having high throughput (speed) and hence are popular for use in the kernel for filesystem, block-device, and memory compression (but also for fast compression of normal files).lzo
,lz4
andzstd
comparison is nicely presented at https://github.com/lz4/lz4:So, as noted in https://en.wikipedia.org/wiki/LZ4_(compression_algorithm),
lz4
"gives a slightly worse compression ratio than theLZO
algorithm ... . However, compression speeds are similar to LZO ..., while decompression speeds can be significantly higher than LZO". Also, as seen from the table,zstd -1
usually gives higher compression ration thanlz4
andlzo
, but lower speed; as for decompression,zstd -1
compressed data decompresses faster thanlzo
, but slower thanlz4
.As seen from the diagram at https://facebook.github.io/zstd/,
zstd -3
can be a reasonable choice (if I'm not mistaken, it's the default value when using btrfs with zstd): it compresses better thangzip
(zlib
) in any mode, and faster.