Ping a Specific Port

Question

700 Software

Asked: 2011-08-19 06:51:15 +0800 CST2011-08-19 06:51:15 +0800 CST 2011-08-19 06:51:15 +0800 CST

How does ZFS Block Level Deduplication fit with Variable Block Size?

772

According to The First Google Result for "ZFS Deduplication"

...
What to dedup: Files, blocks, or bytes?
...
Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated, but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well.
...
ZFS provides block-level deduplication
...

According to Wikipedia's ZFS Article

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. If data compression (LZJB) is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).

I want to make sure I understand this correctly.

Assuming compression is off

If I a randomly filled file of 1GB, then I write a second file that is the same except half way through, I change one of the bytes. Will that file be deduplicated (all except for the changed byte's block?)

If I write a single byte file, will it take a whole 128 kilobytes? If not, will the blocks get larger in the event the file gets longer?

If a file takes two 64kilobyte blocks (would this ever happen?), then would an identical file get deduped after taking a single 128 kilobyte block

If a file is shortened, then part of its block would have been ignored, perhaps the data would not be reset to 0x00 bytes. Would a half used block get deduped?

2 Answers

Voted

Yavor Shahpasov · Answer 1 · 2011-08-19T13:35:48+08:00

ZFS deduplication works on blocks (recordlength) it does not know/care about files. Each block is checksummed using sha256 (by default changeable). If the checksum matches an other block it will just reference the same record and no new data will be written. One problem of deduplication with ZFS is that checksums are kept in memory so large pools will require a lot of memory. So you should only apply reduplication when using large record length

Assuming recordlength 128k

If I a randomly filled file of 1GB, then I write a second file that is the same except half way through, I change one of the bytes. Will that file be deduplicated (all except for the changed byte's block?)

Yes only one block will not be duplicated.

If I write a single byte file, will it take a whole 128 kilobytes? If not, will the blocks get larger in the event the file gets longer?

128k will be allocated, if the file size grows above 128k more blocks will be allocated as needed.

If a file takes two 64kilobyte blocks (would this ever happen?), then would an identical file get deduped after taking a single 128 kilobyte block

A file will take 128k the same file will be deduplicated

If a file is shortened, then part of its block would have been ignored, perhaps the data would not be reset to 0x00 bytes. Would a half used block get deduced?

If the exact same block is found yes

dmeister · Answer 2 · 2011-10-03T02:55:10+08:00

dmeister

2011-10-03T02:55:10+08:002011-10-03T02:55:10+08:00

The variable sized blocks of ZFS are as Yavor mentioned already should not be confused randomized variable sized chunking, also called content-defined chunking or Rabin fingerprinting. Here is a small talk describing the differences.

ZFS used static, but configurable block sizes.

1

How does ZFS Block Level Deduplication fit with Variable Block Size?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?