I have setup zfs on a fresh Ubuntu 20.04. But speed seems awfully slow.
The disks can deliver 140 MB/s if run in parallel:
$ parallel -j0 dd if={} of=/dev/null bs=1M count=1000 ::: /dev/disk/by-id/ata-WDC_WD80E*2
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.23989 s, 145 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.34566 s, 143 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.39782 s, 142 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.43704 s, 141 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.90308 s, 133 MB/s
$ iostat -dkx 1
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 214.00 136112.00 11.00 4.89 5.97 636.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.84 89.60
sdb 216.00 145920.00 0.00 0.00 5.71 675.56 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.77 92.40
sdc 216.00 147456.00 0.00 0.00 5.55 682.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.73 92.80
sdd 199.00 135168.00 0.00 0.00 5.77 679.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.73 88.40
sde 198.00 133120.00 0.00 0.00 5.82 672.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.75 88.00
So the raw parallel performance is around 700 MBytes/s. This is limited by CPU - not disk I/O. If not run in parallel each disk can deliver 170 MB/s.
If I build a RAID5 with the disks:
$ mdadm --create --verbose /dev/md0 --level=5 --raid-devices=5 /dev/disk/by-id/ata-WDC_WD80E*2
# To stop the initial computing of parity:
$ echo frozen > /sys/block/md0/md/sync_action
$ echo 0 > /proc/sys/dev/raid/speed_limit_max
$ dd if=/dev/md0 of=/dev/null bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 40.3711 s, 266 MB/s
$ iostat -dkx 1
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 264.00 270336.00 0.00 0.00 0.00 1024.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 134.00 68608.00 16764.00 99.21 5.48 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.54 73.20
sdc 134.00 68608.00 16764.00 99.21 5.56 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.54 74.00
sdd 134.00 68608.00 16764.00 99.21 5.65 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.54 74.00
sde 134.00 68608.00 16764.00 99.21 5.84 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.54 74.80
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
This is worse than the 700 MBytes/s. top
says the process md0_raid5
takes 80% of a core, and dd
takes 60%. I do not see where the bottleneck is here: The disks are not 100% busy and neither are the CPUs.
On the same 5 disks I create a zfs pool:
zpool create -f -o ashift=12 -O acltype=posixacl -O canmount=off -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa rpool raidz /dev/disk/by-id/ata-WDC_WD80E*2
zfs create -o mountpoint=/data rpool/DATA
Then I write some data:
$ seq 100000000000 | dd bs=1M count=10k iflag=fullblock > out
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 91.2438 s, 118 MB/s
$ iostat -dkx 1
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 79.00 27460.00 8.00 9.20 15.99 347.59 0.00 0.00 0.00 0.00 0.00 0.00 1.12 36.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 113.00 25740.00 0.00 0.00 9.73 227.79 0.00 0.00 0.00 0.00 0.00 0.00 0.88 38.00
sdc 2.00 0.00 0.00 0.00 2.50 0.00 319.00 25228.00 2.00 0.62 1.68 79.08 0.00 0.00 0.00 0.00 0.00 0.00 0.28 31.20
sdd 0.00 0.00 0.00 0.00 0.00 0.00 111.00 25872.00 0.00 0.00 10.87 233.08 0.00 0.00 0.00 0.00 0.00 0.00 0.96 40.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 109.00 26356.00 0.00 0.00 10.74 241.80 0.00 0.00 0.00 0.00 0.00 0.00 0.93 40.80
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
z_wr_iss uses 30% CPU.
When I read back the data:
$ pv out >/dev/null
10.0GiB 0:00:44 [ 228MiB/s]
$ iostat -dkx 1
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 1594.00 51120.00 5.00 0.31 0.25 32.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.40
sdb 1136.00 36312.00 0.00 0.00 0.20 31.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 96.40
sdc 1662.00 53184.00 0.00 0.00 0.20 32.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
sdd 1504.00 48088.00 0.00 0.00 0.19 31.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.60
sde 1135.00 36280.00 0.00 0.00 0.21 31.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.20
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
iostat -dkx 1
says disks are 100% I/O utilized. top
says pv
uses around 60% CPU and zthr_procedure
around 30% CPU. With 2 cores this leaves a full core idle.
This surprises me: If I get a raw parallel read performance of 700 MB/s, I would expect zfs could utilize this, too, and be CPU constrained (and probably give in the order of 300 MB/s).
The only guess I have is if zfs
uses a tiny stripe size or block size, and forces the drives to flush their cache often. But that would hardly make sense on reading.
Why does iostat -dkx 1
say disks are 100% I/O utilized at only 40 MBytes/s per disk when using zfs? Can I recreate the pool in a way where zfs better utilizes the disk I/O?
0 Answers