I'm building raw disk images (ie, dd, chroot to install linux). During the customization process I may delete files, use temporary files, etc.
What is the best way to delete these files to ensure the image is most compressible?
I'm assuming if I simply rm the file, it's just deleting records from the FAT to mark the blocks as available. This leaves the data in place, so when I gzip or bzip2 the image it still has to pack that data up. I assume things would be a lot tighter if I could tell the FS to write zeros to the blocks instead.
A bit of detail: these are CentOS 6.4 installs on ext4, but I would expect the answer applies to most linux distros using most file systems. The base filesystem I generate is via a command like dd if=/dev/zero of=filesystem.image bs=1M count=10240
. A typical 10GB disk image from a vanilla install will compress down to roughly 500MB. I bet if I did a more aggressive cleanup of temp files and such, I could get it a lot tighter.
Thanks!
Zoredache's comment got me on the right track (see How to zero fill a virtual disk's free space on windows for better compression?).
As far as I can tell, zerofree homepage is http://intgat.tigress.co.uk/rmy/uml/index.html. There are two things hosted here: the zerofree tool and a kernel patch. The kernel patch does not work with ext4, but adds the ability to set a mount flag to zero out files on delete. The zerofree program does work on ext4.
There are no zerofree CentOS 6 RPM's available, but I was able to compile it on CentOS6 using the CentOS5 srpm.
You can create a file/files containing only zeroes, to fill up the filesystem free space. The blocks marked as free will then be overwritten with zeroes. Afterwards, delete these files.
Update: Since this answer has been downvoted, I decided to test whether this is actually true or my knowledge is wrong. So far it looks like this is working as I expected.
I created two images with the same command as OP, except I made it 1G in size, since this is for testing purpouses:
I then proceeded to format the image files as ext4, mount them on separate directories and copy two directories containing pictures on both images. One directory was 71M and the other — 461M. Afterwards, I deleted the 461M directory from both mountpoints. So far all the actions were identical. I then left the first image alone and created a zerofile taking up all free space on the second one, deleting it afterwards:
df
once again showed that the filesystems were filled the identically (system partitions removed for clarity) :After umounting and compressing both images with bzip2, the output of
ls
speaks for itself:Although this might take longer of bigger images than just zeroing out the space that was actually used for files, this method is simple, effective and uses only
dd
so no additional tools are needed.