I have directories with old incremental backups and they are full of redundant copies of various files. My plan was to use ZFS that handles checksums of files and prevent redundancy.
So a model situation:
cd /poolname/zalohy
zfs list -p poolname
NAME USED AVAIL REFER MOUNTPOINT
poolname 995328 374734901248 98304 /poolname
for i in {0..10}; do echo {1..99999} >file$i.txt; done # I create eleven identical files of the size 588888 bytes.
zfs list -p poolname
NAME USED AVAIL REFER MOUNTPOINT
poolname 5677056 374730219520 98304 /poolname
374734901248 - 374730219520 = 4 681 728, i.e. cca 5MB.
I expected that 11 identical files (with the same checksum) would take a bit more than 588888 bytes, so ten times less.
Where is the problem. How to handle this redundancy? Is there a better file system than ZFS for this purpose?
Thanks a lot for the help.
In general
This requires that your ZFS pool (or filesystem) has been configured with Deduplication enabled.
From OpenZFS documentation:
Deduplication is disabled by default, because as stated above it can be very CPU and memory intensive.
As with all ZFS properties, the
dedup
property can be set on ZFS pool or dataset (filesystem) level, and be inherited by underlying filesystems.Before enabling
dedup
, you should consider the following:To check if your pool will benefit from
dedup
, you can run (wheretank
is the pool name):The
-S
simulatesdedup
statistics, and is only usable on the entire pool. The output will be a simulated DDT (deduplication table), and it ends with some stats like:As a rule of thumb, if the estimated
dedup
ratio is above 2, deduplication could be an option to save space. In the above example, since thededup
ratio is 1.2, it probably isn't worth it.To check the
dedup
property of a pool, type:And to set deduplication for the pool, type:
And to set it only for a dataset (
tank/home
), type:After
dedup
has been enabled on an existing pool, only newly created data will be deduplicated.As mentioned in the documentation, it might be a better option to set the
compression=lz4
property on your pool instead (lz4
compression have little to no performance impact on most systems).For your situation
For your particular situation, I would create a specific dataset (filesystem) only for backup, and enable dedup on only this dataset.
For instance, if you create the ZFS dataset
poolname/backup
:And then set:
In this way, you can test if it works in the expected way. And if you run into problems, you can always transfer your backup to a normal ZFS dataset without
dedup
enabled (but maybe with compression instead).NB: It's not possible to disable deduplication on a pool or dataset once it's been enabled. In this case, it's only possible to backup the data, destroy the dataset, and move the data to another dataset without deduplication. This is why I would never recommend to enable deduplication on an entire Zpool.
Another helpful user on Mastodon just posted a link to the
hardlink
command (https://manpages.debian.org/unstable/util-linux/hardlink.1.en.html) Which sounds like a better solution to your problem than the program I wrote (mentioned in a comment to the longer and definitive answer WRT ZFS.)On Ubuntu 22.04,
hardlink
is installed by default (as part of theutil-linux
package), and in your case the default command to run would be (if the directory/poolname/zalohy
contains the backup data):Please refer to the
hardlink
man page for further information.