What I expect from two commands which always produce the same output on their own, is them to always produce the same output when put in a pipeline, but apparently this is not the case for tar | gzip
:
~/test$ ls
~/test$ dd if=/dev/urandom of=file bs=10000000 count=1
1+0 records in
1+0 records out
10000000 bytes (10 MB) copied, 0,877671 s, 11,4 MB/s // Creating a 10MB random file
~/test$ tar cf file.tar file // Archiving the file in a tarball
~/test$ tar cf file1.tar file // Archiving the file again in another tarball
~/test$ cmp file.tar file1.tar // Comparing the two output files
~/test$ gzip -c file > file.gz // Compressing the file with gzip
~/test$ gzip -c file > file1.gz // Compressing the file again with gzip
~/test$ cmp file.gz file1.gz // Comparing the two output files
~/test$ tar c file | gzip > file.tar.gz // Archiving and compressing the file
~/test$ tar c file | gzip > file1.tar.gz // Archiving and compressing the file again
~/test$ cmp file.tar.gz file1.tar.gz // Comparing the output files
file.tar.gz file1.tar.gz differ: byte 5, line 1 // File differs at byte 5
~/test$ cmp -i 5 file.tar.gz file1.tar.gz // Comparing the output files after byte 5
~/test$
Adding to this, even tar cfz file.tar file
on his own always produces different outputs:
~/test$ tar cfz file2.tar file // Archiving and compressing the file
~/test$ tar cfz file3.tar file // Archiving and compressing the file again
~/test$ cmp file2.tar.gz file3.tar.gz // Comparing the output files
file2.tar.gz file3.tar.gz differ: byte 5, line 1 // File differs at byte 5
~/test$ cmp -i 5 file2.tar.gz file3.tar.gz // Comparing the output files after byte 5
~/test$
While splitting the pipeline finally produces an output that makes sense:
~/test$ gzip -c file.tar > file4.tar.gz
~/test$ gzip -c file.tar > file5.tar.gz
~/test$ cmp file4.tar.gz file5.tar.gz
~/test$
It looks like whatever happens happens only when tar
's output is piped directly into gzip
.
What is the explanation of this behavior?
The header for the resulting gzip file is different depending on how it is called.
Gzip tries to store some origin information in the resulting file header. When called on normal files this includes the origin file name by default and a timestamp, which it gets from the original file.
When it is made to compress data piped to it, the origin is not as easy as with a normal file, so it resorts to a different naming and time stamp convention.
To prove this try adding the -n param to the offending lines in your example as...
Now the files are identical again...
From
man gzip
...So the difference is indeed the original file name and time stamp information that is turned off by the -n param.
Gzip files include a timestamp. If you create two gzip files at different times, these will different by the creation time, not by content.