While experimenting with Proxmox VE, we have encountered a strange performance problem:
VM disks can be stored (among other options) as individual raw ZFS zvols, or as qcow2 files on a single common dataset.
For some reason, sequential write performance to the zvols is massively worse than to the dataset, even though both reside on the same zpool.
This doesn't affect the VM's normal operation noticeably, but makes a massive difference when hibernating/RAM-snapshotting the VM (140 sec vs 44 sec for hibernating 32 GB RAM).
How can this occur when it's all the same data on the same zpool?
Here's what the write performance looks like on (1) the dataset, (2) a zvol created by Proxmox, and (3) a manually created zvol with a larger volblocksize. Strangely, the write throughput becomes slightly faster when (4) creating a filesystem on the exact same zvol and writing to that instead.
test.bin contains 16 GiB of urandom data to circumvent ZFS compression. I've run each test a few times and end up roughly in the same ballpark, so caching doesn't seem to be much of a factor.
# rpool/ROOT recordsize 128K default
> dd if=/mnt/ramdisk/test.bin of=/var/lib/vz/images/test.bin bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 20.9524 s, 820 MB/s
# with conv=fdatasync, this drops to about 529 MB/s
# rpool/data/vm-112-disk-0 volblocksize 8K default
> dd if=/mnt/ramdisk/test.bin of=/dev/zvol/rpool/data/vm-112-disk-0 bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 67.7121 s, 254 MB/s
# with conv=fdatasync, this drops to about 151 MB/s
# rpool/data/vm-112-disk-2 volblocksize 128K -
> dd if=/mnt/ramdisk/test.bin of=/dev/zvol/rpool/data/vm-112-disk-2 bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 35.2894 s, 487 MB/s
# with conv=fdatasync, this drops to about 106 MB/s
> mkfs.ext4 /dev/zvol/rpool/data/vm-112-disk-2
> mount /dev/zvol/rpool/data/vm-112-disk-2 /mnt/tmpext4
> dd if=/mnt/ramdisk/test.bin of=/mnt/tmpext4/test.bin bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 23.7413 s, 724 MB/s
# with conv=fdatasync, this drops to about 301 MB/s
The system and zpool setup looks like this:
> uname -r
5.4.78-2-pve
> zfs version
zfs-0.8.5-pve1
zfs-kmod-0.8.5-pve1
> zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:10:19 with 0 errors on Sun Mar 14 00:34:20 2021
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST4000NE001-xxxxxx_xxxxxxxV-part3 ONLINE 0 0 0
ata-ST4000NE001-xxxxxx_xxxxxxx4-part3 ONLINE 0 0 0
ata-ST4000NE001-xxxxxx_xxxxxxx3-part3 ONLINE 0 0 0
ata-ST4000NE001-xxxxxx_xxxxxxx9-part3 ONLINE 0 0 0
ata-ST4000NE001-xxxxxx_xxxxxxx6-part3 ONLINE 0 0 0
ata-ST4000NE001-xxxxxx_xxxxxxxP-part3 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
ata-INTEL_SSDSC2BB800G6_xxxxxxxxx5xxxxxxxx-part2 ONLINE 0 0 0
ata-INTEL_SSDSC2BB800G6_xxxxxxxxx6xxxxxxxx-part2 ONLINE 0 0 0
cache
ata-INTEL_SSDSC2BB800G6_xxxxxxxxx5xxxxxxxx-part1 ONLINE 0 0 0
ata-INTEL_SSDSC2BB800G6_xxxxxxxxx6xxxxxxxx-part1 ONLINE 0 0 0
errors: No known data errors
You were (are) likely experiencing OpenZFS issue #11407 [Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs.
quote from sempervictus:
I've personally done extensive benchmarking of mechanical drives (spinning rust) for this exact issue and kvm disk images (my preference today is raw) stored on datasets beat zvol in nearly all test cases, and exhibited "normal" system load, whereas zvol tests were nearly always causing orders of magnitude more load. Some zvol test configs
fio
withbuffered=1
also reliably caused system instability, lock ups and crashes.