I am using tar
to archive a group of very large (multi-GB) bz2
files.
If I use tar -tf file.tar
to list the files within the archive, this takes a very long time to complete (~10-15 minutes).
Likewise, cpio -t < file.cpio
takes just as long to complete, plus or minus a few seconds.
Accordingly, retrieving a file from an archive (via tar -xf file.tar myFileOfInterest.bz2
for example) is as slow.
Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly?
For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars).
Is there a tool (or argument to tar
or cpio
) that allows efficient retrieval of a file within the archive?
tar (and cpio and afio and pax and similar programs) are stream-oriented formats - they are intended to be streamed direct to a tape or piped into another process. while, in theory, it would be possible to add an index at the end of the file/stream, i don't know of any version that does (it would be a useful enhancement though)
it won't help with your existing tar or cpio archives, but there is another tool, dar ("disk archive"), that does create archive files that contain such an index and can give you fast direct access to individual files within the archive.
if dar isn't included with your unix/linux-dist, you can find it at:
http://dar.linux.free.fr/
You could use SquashFS for such archives. It is
The only drawback I know of is that it is read-only.
http://squashfs.sourceforge.net/ http://www.tldp.org/HOWTO/SquashFS-HOWTO/whatis.html
While it doesn't store an index,
star
is purported to be faster thantar
. Plus it supports longer filenames and has better support for file attributes.As I'm sure you're aware, decompressing the file takes time and would likely be a factor in the speed of extraction even if there was an index.
Edit: You might also want to take a look at
xar
. It has an XML header that contains information about the files in the archive.From the referenced page:
The only archive format I know of that stores an index is ZIP, because I've had to reconstruct corrupted indexes more than once.
Thorbjørn Ravn Anderser is right. GNU tar creates "seekable" archives by default. But it does not use that information when it reads these archives if -n option is not given. With -n option I just extracted 7GB file from 300GB archive in time required to read/write 7GB. Without -n it took more than hour and produced no result.
I'm not sure how compression affects this. My archive was not compressed.Compressed archives are not "seekable" because current (1.26) GNU tar offloads compression to external program.It doesn't index that I know of, but I use dump & restore with large files, and navigating the restore tree in interactive mode to select random files is VERY fast.
You can use the 7z (7zip) archive/compression format if you have access to the
p7zip-full
package.On Ubuntu you can use this command to install it:
To create an archive you can use
7z a <archive_name> <file_or_directory>
and if you do not want to compress the files and want to just "store" them as-is, you can use the-mx0
option like:You can then extract the files using
7z e
:Or you can list the index of the archive with the
7z l
which is handy for searching withgrep
:This is also the
t
option to test integrity,u
to add/update a file to the archive, andd
to delete a file.IMPORTANT NOTE
Do not use the 7zip format for linux filesystem backups as it does not store the owner and group of the files contained.
I belive GNU tar is capable of doing what you want, but I cannot locate a definitive resource saying so.
In any case you need a archiving format with an index (since that will allow you to do what you want). I do not belive ZIP-files can grow that big, unfortunately.