I know that the performance of ZFS heavily depends on the amount of free space:
Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. [...] Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer. ZFS_Best_Practices_Guide, solarisinternals.com (archive.org)
Now, suppose I have a raidz2 pool of 10T hosting a ZFS file system volume
. Now I create a child file system volume/test
and give it a reservation of 5T.
Then I mount both file systems per NFS to some host and perform some work. I understand that I can't write to volume
more than 5T, because the remaining 5T are reserved to volume/test
.
My first question is, how will the performance drop, if I fill my volume
mount point with ~5T? Will it drop, because there is no free space in that file system for ZFS' copy-on-write and other meta-stuff? Or will it remain the same, since ZFS can use the free space within the space reserved for volume/test
?
Now the second question. Does it make a difference, if I change the setup as follows? volume
now has two file systems, volume/test1
and volume/test2
. Both are given a 3T reservation each (but no quotas). Assume now, I write 7T to test1
. Will the performance for both file systems be the same, or will it be different for every file system? Will it drop, or remain the same?
Thanks!
The performance degradation occurs when your zpool is either very full or very fragmented. The reason for this is the mechanism of free block discovery employed with ZFS. Opposed to other file systems like NTFS or ext3, there is no block bitmap showing which blocks are occupied and which are free. Instead, ZFS divides your zvol into (usually 200) larger areas called "metaslabs" and stores AVL-trees1 of free block information (space map) in each metaslab. The balanced AVL tree allows for an efficient search for a block fitting the size of the request.
While this mechanism has been chosen for reasons of scale, unfortunately it also turned out to be a major pain when a high level of fragmentation and/or space utilization occurs. As soon as all metaslabs carry a significant amount of data, you get a large number of small areas of free blocks as opposed to a small numbers of large areas when the pool is empty. If ZFS then needs to allocate 2 MB of space, it starts reading and evaluating all metaslabs' space maps to either find a suitable block or a way to break up the 2 MB into smaller blocks. This of course takes some time. What is worse is the fact that it will cost a whole lot of I/O operations as ZFS would indeed read all space maps off the physical disks. For any of your writes.
The drop in performance might be significant. If you fancy pretty pictures, take a look at the blog post over at Delphix which has some numbers taken off an (oversimplified but yet valid) zfs pool. I am shamelessly stealing one of the graphs - look at the blue, red, yellow, and green lines in this graph which are (respectively) representing pools at 10%, 50%, 75%, and 93% capacity drawn against write throughput in KB/s while becoming fragmented over time:
A quick & dirty fix to this has traditionally been the metaslab debugging mode (just issue
echo metaslab_debug/W1 | mdb -kw
at run-time for instantly changing the setting). In this case, all space maps would be kept in the OS RAM, removing the requirement for excessive and expensive I/O on each write operation. Ultimately, this also means you need more memory, especially for large pools, so it is kind of a RAM for storage horse-trade. Your 10 TB pool probably will cost you 2-4 GB of memory2, but you will be able to drive it to 95% of utilization without much hassle.1 it is a bit more complicated, if you are interested, look at Bonwick's post on space maps for details
2 if you need a way to calculate an upper limit for the memory, use
zdb -mm <pool>
to retrieve the number ofsegments
currently in use in each metaslab, divide it by two to model a worst-case scenario (each occupied segment would be followed by a free one), multiply it by the record size for an AVL node (two memory pointers and a value, given the 128-bit nature of zfs and 64-bit addressing would sum up to 32 bytes, although people seem to generally assume 64 bytes for some reason).Reference: the basic outline is contained in this posting by Markus Kovero on the zfs-discuss mailing list, although I believe he made some mistakes in his calculation which I hope to have corrected in mine.
Yes. You need to keep free space in your pool. It's mainly for copy-on-write actions and snapshots. Performance declines at about 85% utilization. You can go higher, but there's a definite impact.
Don't mess with reservations. Especially with NFS. It's not necessary. Maybe for a zvol, but not NFS.
I don't see the confusion, though. If you have 10T, Don't use more than 85% of it. Size your shares appropriately, using quotas to cap their use. Or don't use any quotas and monitor your overall pool usage.