How can I use docker without sudo?

Question

xerostomus

Asked: 2023-12-20 21:20:50 +0800 CST2023-12-20 21:20:50 +0800 CST 2023-12-20 21:20:50 +0800 CST

How does ZFS handle redundant copies of a file?

772

I have directories with old incremental backups and they are full of redundant copies of various files. My plan was to use ZFS that handles checksums of files and prevent redundancy.

So a model situation:

cd /poolname/zalohy
zfs list -p poolname
NAME        USED         AVAIL     REFER  MOUNTPOINT
poolname  995328  374734901248     98304  /poolname

for i in {0..10}; do echo {1..99999} >file$i.txt; done # I create eleven identical files of the size 588888 bytes.

zfs list -p poolname
NAME         USED         AVAIL     REFER  MOUNTPOINT
poolname  5677056  374730219520     98304  /poolname

374734901248 - 374730219520 = 4 681 728, i.e. cca 5MB.

I expected that 11 identical files (with the same checksum) would take a bit more than 588888 bytes, so ten times less.

Where is the problem. How to handle this redundancy? Is there a better file system than ZFS for this purpose?

Thanks a lot for the help.

2 Answers

Voted

Artur Meinild · Answer 1 · 2023-12-20T22:29:25+08:00

In general

This requires that your ZFS pool (or filesystem) has been configured with Deduplication enabled.

From OpenZFS documentation:

Deduplication

Deduplication is the process for removing redundant data at the block level, reducing the total amount of data stored. If a file system has the dedup property enabled, duplicate data blocks are removed synchronously. The result is that only unique data is stored and common components are shared among files.

Deduplicating data is a very resource-intensive operation. It is generally recommended that you have at least 1.25 GiB of RAM per 1 TiB of storage when you enable deduplication. Calculating the exact requirement depends heavily on the type of data stored in the pool.

Enabling deduplication on an improperly-designed system can result in performance issues (slow I/O and administrative operations). It can potentially lead to problems importing a pool due to memory exhaustion. Deduplication can consume significant processing power (CPU) and memory as well as generate additional disk I/O.

Before creating a pool with deduplication enabled, ensure that you have planned your hardware requirements appropriately and implemented appropriate recovery practices, such as regular backups. Consider using the compression property as a less resource-intensive alternative.

Deduplication is disabled by default, because as stated above it can be very CPU and memory intensive.

As with all ZFS properties, the dedup property can be set on ZFS pool or dataset (filesystem) level, and be inherited by underlying filesystems.

Before enabling dedup, you should consider the following:

Make sure your data will actually benefit from deduplication
Make sure your system has enough CPU and memory to support the feature

To check if your pool will benefit from dedup, you can run (where tank is the pool name):

sudo zdb -S tank

The -S simulates dedup statistics, and is only usable on the entire pool. The output will be a simulated DDT (deduplication table), and it ends with some stats like:

dedup = 1.20, compress = 1.28, copies = 1.03, dedup * compress / copies = 1.50

As a rule of thumb, if the estimated dedup ratio is above 2, deduplication could be an option to save space. In the above example, since the dedup ratio is 1.2, it probably isn't worth it.

To check the dedup property of a pool, type:

zfs get dedup tank

And to set deduplication for the pool, type:

sudo zfs set dedup=on tank

And to set it only for a dataset (tank/home), type:

sudo zfs set dedup=on tank/home

After dedup has been enabled on an existing pool, only newly created data will be deduplicated.

As mentioned in the documentation, it might be a better option to set the compression=lz4 property on your pool instead (lz4 compression have little to no performance impact on most systems).

For your situation

For your particular situation, I would create a specific dataset (filesystem) only for backup, and enable dedup on only this dataset.

For instance, if you create the ZFS dataset poolname/backup:

sudo zfs create poolname/backup

And then set:

sudo zfs set dedup=on poolname/backup

In this way, you can test if it works in the expected way. And if you run into problems, you can always transfer your backup to a normal ZFS dataset without dedup enabled (but maybe with compression instead).

NB: It's not possible to disable deduplication on a pool or dataset once it's been enabled. In this case, it's only possible to backup the data, destroy the dataset, and move the data to another dataset without deduplication. This is why I would never recommend to enable deduplication on an entire Zpool.

HankB · Answer 2 · 2023-12-22T05:52:27+08:00

HankB

2023-12-22T05:52:27+08:002023-12-22T05:52:27+08:00

Another helpful user on Mastodon just posted a link to the hardlink command (https://manpages.debian.org/unstable/util-linux/hardlink.1.en.html) Which sounds like a better solution to your problem than the program I wrote (mentioned in a comment to the longer and definitive answer WRT ZFS.)

On Ubuntu 22.04, hardlink is installed by default (as part of the util-linux package), and in your case the default command to run would be (if the directory /poolname/zalohy contains the backup data):

hardlink /poolname/zalohy

Please refer to the hardlink man page for further information.

1

How does ZFS handle redundant copies of a file?

In general

Deduplication

For your situation

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?