Been running a couple of fio
tests on a new server with the following setup:
- 1x Samsung PM981a 512GB M.2 NVMe drive.
- Proxmox installed with ZFS on root.
- 1x VM with 30GB space created and Debian 10 installed.
- 6x Intel P4510 2TB U.2 NVMe drives connected to 6x dedicated PCIe 4.0 x4 lanes with OCuLink.
- Directly attached to the single VM.
- Configured as RAID10 in the VM (3x mirrors striped).
- Motherboard / CPU / memory: ASUS KRPA-U16 / EPYC 7302P / 8x32GB DDR4-3200
The disks are rated up to 3,200 MB/s sequential reads. From a theoretical point of view that should give a max bandwidth of 19.2 GB/s.
Running fio
with numjobs=1
on the ZFS RAID I'm getting results in the range ~2,000 - 3,000 MB/s (the disks are capable of the full 3,200 MB/s when testing without ZFS or any other overhead, for example, while running Crystal Disk Mark in Windows installed directly on one of the disks):
fio --name=Test --size=100G --bs=1M --iodepth=8 --numjobs=1 --rw=read --filename=fio.test
=>
Run status group 0 (all jobs):
READ: bw=2939MiB/s (3082MB/s), 2939MiB/s-2939MiB/s (3082MB/s-3082MB/s), io=100GiB (107GB), run=34840-34840msec
Seems reasonable everything considered. Might also be CPU limited as one of the cores will be sitting on 100% load (with some of that spent on ZFS processes).
When I increase numjobs
to 8-10 things get a bit weird though:
fio --name=Test --size=100G --bs=1M --iodepth=8 --numjobs=10 --rw=read --filename=fio.test
=>
Run status group 0 (all jobs):
READ: bw=35.5GiB/s (38.1GB/s), 3631MiB/s-3631MiB/s (3808MB/s-3808MB/s), io=1000GiB (1074GB), run=28198-28199msec
38.1 GB/s - well above the theoretical maximum bandwidth.
What exactly is the explanation here?
Additions for comments:
VM configuration:
iotop
during test:
The first
fio
(the one with--numjobs=1
) sequentially executes any read operation, having no benefit from your stripe config apart for quick read-ahead/prefetch:iodepth
only applies to async reads done vialibaio
engine, which in turn requires true support forO_DIRECT
(which ZFS lacks). You can try to increase the prefetch window up from the default 8M to something as 64M (echo 67108864 > /sys/module/zfs/parameters/zfetch_max_distance
). Of course your mileage may vary, so be sure to check this does not impair other workloads.The second
fio
(the one with--numjobs=8
) is probably skewed by ARC caching. To be sure, simply open another terminal runningdstat -d -f
: you will see the true transfer speed of each disk, and it will surely be in-line with their theoretical max transfer rate. You can also retry thefio
test with a freshly booted machine (so with an empty ARC) to see if things change.For sequential I/O tests with multiple jobs, each job (ie, thread) has a thread-specific file pointer (block address for raw devices) that starts at zero by default and advances independently of the other threads. That means fio will issue read requests to the filesystem with duplicate/overlapping file pointers/block addresses across the jobs. You can see this in action if you use the
write_iolog
option. The overlapping requests will skew the benchmark result since they'll likely be satisfied by a read cache, either in the filesystem (when testing to a file) or by the device (when running on a raw volume).What you want instead is a single job and then modify the
iodepth
parameter exclusively to control the I/O queue depth. This specifies the number of concurrent I/Os each job is allowed to have active.The only downside is total achievable IOPs may become single-core/thread limited. This shouldn't be a problem for large-block sequential workloads since they're not IOPs bound. For random I/O you definitely want to use multiple jobs, especially on NVMe drives that can handle upwards of a million IOPs.