I have a related question about this problem, but it got too complicated and too big, so I decided I should split up the issue into NFS and local issues. I have also tried asking about this on the zfs-discuss mailing list without much success.
Slow copying between NFS/CIFS directories on same server
Outline: How I'm setup and what I'm expecting
- I have a ZFS pool with 4 disks. 2TB RED configured as 2 mirrors that are striped (RAID 10). On Linux, zfsonlinux. There are no cache or log devices.
- Data is balanced across mirrors (important for ZFS)
- Each disk can read (raw w/dd) at 147MB/sec in parallel, giving a combined throughput of 588MB/sec.
- I expect about 115MB/sec write, 138MB/sec read and 50MB/sec rewrite of sequential data from each disk, based on benchmarks of a similar 4TB RED disk. I expect no less than 100MB/sec read or write, since any disk can do that these days.
- I thought I'd see 100% IO utilization on all 4 disks when under load reading or writing sequential data. And that the disks would be putting out over 100MB/sec while at 100% utilization.
- I thought the pool would give me around 2x write, 2x rewrite, and 4x read performance over a single disk - am I wrong?
- NEW I thought a ext4 zvol on the same pool would be about the same speed as ZFS
What I actually get
I find the read performance of the pool is not nearly as high as I expected
bonnie++ benchmark on pool from a few days ago
Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP igor 63G 99 99 232132 47 118787 27 336 97 257072 22 92.7 6
bonnie++ on a single 4TB RED drive on it's own in a zpool
Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP igor 63G 101 99 115288 30 49781 14 326 97 138250 13 111.6 8
According to this the read and rewrite speeds are appropriate based on the results from a single 4TB RED drive (they are double). However, the read speed I was expecting would have been about 550MB/sec (4x the speed of the 4TB drive) and I would at least hope for around 400MB/sec. Instead I am seeing around 260MB/sec
bonnie++ on the pool from just now, while gathering the below information. Not quite the same as before, and nothing has changed.
Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP igor 63G 103 99 207518 43 108810 24 342 98 302350 26 256.4 18
zpool iostat during write. Seems OK to me.
capacity operations bandwidth pool alloc free read write read write -------------------------------------------- ----- ----- ----- ----- ----- ----- pool2 1.23T 2.39T 0 1.89K 1.60K 238M mirror 631G 1.20T 0 979 1.60K 120M ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469 - - 0 1007 1.60K 124M ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX - - 0 975 0 120M mirror 631G 1.20T 0 953 0 117M ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536 - - 0 1.01K 0 128M ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE - - 0 953 0 117M
zpool iostat during rewrite. Seems ok to me, I think.
capacity operations bandwidth pool alloc free read write read write -------------------------------------------- ----- ----- ----- ----- ----- ----- pool2 1.27T 2.35T 1015 923 125M 101M mirror 651G 1.18T 505 465 62.2M 51.8M ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469 - - 198 438 24.4M 51.7M ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX - - 306 384 37.8M 45.1M mirror 651G 1.18T 510 457 63.2M 49.6M ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536 - - 304 371 37.8M 43.3M ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE - - 206 423 25.5M 49.6M
This is where I wonder what's going on
zpool iostat during read
capacity operations bandwidth pool alloc free read write read write -------------------------------------------- ----- ----- ----- ----- ----- ----- pool2 1.27T 2.35T 2.68K 32 339M 141K mirror 651G 1.18T 1.34K 20 169M 90.0K ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469 - - 748 9 92.5M 96.8K ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX - - 623 10 76.8M 96.8K mirror 651G 1.18T 1.34K 11 170M 50.8K ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536 - - 774 5 95.7M 56.0K ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE - - 599 6 74.0M 56.0K
iostat -x during the same read operation. Note how IO % is not at 100%.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.60 0.00 661.30 6.00 83652.80 49.20 250.87 2.32 3.47 3.46 4.87 1.20 79.76 sdd 0.80 0.00 735.40 5.30 93273.20 49.20 251.98 2.60 3.51 3.51 4.15 1.20 89.04 sdf 0.50 0.00 656.70 3.80 83196.80 31.20 252.02 2.23 3.38 3.36 6.63 1.17 77.12 sda 0.70 0.00 738.30 3.30 93572.00 31.20 252.44 2.45 3.33 3.31 7.03 1.14 84.24
zpool and test dataset settings:
- atime is off
- compression is off
- ashift is 0 (autodetect - my understanding was that this was ok)
- zdb says disks are all ashift=12
- module - options zfs zvol_threads=32 zfs_arc_max=17179869184
- sync = standard
Edit - Oct, 30, 2015
I did some more testing
- dataset bonnie++ w/recordsize=1M = 226MB write, 392MB read much better
- dataset dd w/record size=1M = 260MB write, 392MB read much better
- zvol w/ext4 dd bs=1M = 128MB write, 107MB read why so slow?
- dataset 2 processess in parallel = 227MB write, 396MB read
- dd direct io makes no different on dataset and on zvol
I am much happier with the performance with the increased record size. Almost every file on the pool is way over 1MB. So I'll leave it like that. The disks are still not getting 100% utilization, which makes me wonder if it could still be much faster. And now I'm wondering why the zvol performance is so lousy, as that is something I (lightly) use.
I am happy to provide any information requested in the comments/answers. There is also tons of information posted in my other question: Slow copying between NFS/CIFS directories on same server
I am fully aware that I may just not understand something and that this may not be a problem at all. Thanks in advance.
To make it clear, the question is: Why isn't the ZFS pool as fast as I expect? And perhaps is there anything else wrong?
I managed to get speeds very close to the numbers I was expecting.
I was looking for 400MB/sec and managed 392MB/sec. So I say that is problem solved. With the later addition of a cache device, I managed 458MB/sec read (cached I believe).
1. This at first was achieved simply by increasing the ZFS dataset
recordsize
value to1M
I believe this change just results in less disk activity, thus more efficient large synchronous reads and writes. Exactly what I was asking for.
Results after the change
2. I managed even better when I added a cache device (120GB SSD). The write is a tad slower, I'm not sure why.
The trick with the cache device was to set
l2arc_noprefetch=0
in /etc/modprobe.d/zfs.conf. It allows ZFS to cache streaming/sequential data. Only do this if your cache device is faster than your array, like mine.After benefiting from the recordsize change on my dataset, I thought it might be a similar way to deal with poor zvol performance.
I came across severel people mentioning that they obtained good performance using a
volblocksize=64k
, so I tried it. No luck.But then I read that ext4 (the filesystem I was testing with) supports options for RAID like
stride
andstripe-width
, which I've never used before. So I used this site to calculate the settings needed: https://busybox.net/~aldot/mkfs_stride.html and formatted the zvol again.I ran
bonnie++
to do a simple benchmark and the results were excellent. I don't have the results with me unfortunately, but they were atleast 5-6x faster for writes as I recall. I'll update this answer again if I benchmark again.Your results are perfectly reasonable, while your expectation are not: you overstate the read performance improvement given by RAID1 (and, by extension, by RAID10). The point is that a 2-way mirroring give at most 2x the read speed/IOPs of the single disk, but real world performance can be anywhere between 1x-2x.
Let's clarify with an example. Imagine to have a system with 2-way mirror, with each disk capable of 100 MB/s (sequential) and 200 IOPS. With a queue depth of 1 (max one single, outstanding request) this array will have no advantage over a single disk: RAID1 splits IO requests on the two disk's queue, but it does not split a single request over two disks (at least, any implementation I saw behave in this manner). On the other side, if your IO queue is bigger (eg: you have 4/8 outstanding requests), total disk throughput will be significantly higher than single disk.
A similar point can be done for RAID0, but in this case what determines the average improvements is a function of not only queue size, but IO request size also: if your average IO size is lower than chunk size, then it will not be striped on two (or more) disks, but it will be served by a single one. Your results with the increased Bonnie++ recordsize show this exact behavior: striping greatly benefits from bigger IO size.
Now should be clear that combining the two RAID level in a RAID10 array will not lead to linear performance scaling, but it sets an upper limit for it. I'm pretty sure that if you run multiple dd/bonnie++ instances (or use
fio
to directly manipulate IO queue) you will have results more in-line with your original expectation, simply because you will tax your IO array in a more complete manner (multiple oustanding sequential/random IO requests), rather than loading it of single, sequential IO requests alone.zfs writes aren't really fast but not bad. zfs reads are extremely slow take a look by your own: 1) #Preparation: cd /mytestpool/mytestzfs;for f in urf{0..9};do dd if=/dev/urandom of=$f bs=1M count=102400;done; #Get a directory path with lots of subdirs and files (of ~50GB) and check size with eg: du -sh /mytestpool/mytestzfs/appsdir 2) reboot 3) time cat /mytestpool/mytestzfs/urf0 >/dev/null; date;for f in /mytestpool/mytestzfs/urf{1..9};do cat $f >/dev/null & wait;done;date ; time tar cf - /mytestpool/mytestzfs/appsdir|cat - >/dev/null 4) #Look at iostat, iotop or zpool iostat: you see to much there ! 5) After reads are done take a calculator and divide singlefilesize/sec, divide 9x singlefilessize/sec and divide directorysize/sec. That's what you get out of your zfs when disks will get more and more full of data and more as you have memory.