Today I did some tests on L2ARC using the latest ZFS on Linux 0.7.10. I have seen that the L2ARC gets filled with data, but with the default module settings the data that is residing in L2ARC cache is never touched. Instead the data is read from the vdevs of the main pool. I also have seen this behaviour in 0.7.9 and I am not sure if that is the expected behaviour.
Even if that would be the expected behaviour, I think it is odd to spoil the L2ARC with data that is never read.
The test installation is a VM:
- CentOS 7.5 with latest patches
- ZFS on Linux 0.7.10
- 2GB RAM
I did some ZFS settings:
l2arc_headroom=1024
andl2arc_headroom=1024
to speed up the L2ARC population
Here is how the pool was created and the layout. I know it is rather odd for a real-world setup, but this was intended for L2ARC testing only.
[root@host ~]# zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc cache sdd -f
[root@host ~]# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 2.95G 333K 2.95G - 0% 0% 1.00x ONLINE -
raidz2 2.95G 333K 2.95G - 0% 0%
sda - - - - - -
sdb - - - - - -
sdc - - - - - -
cache - - - - - -
sdd 1010M 512 1009M - 0% 0%
Now write some data to a file and look at the device usage.
[root@host ~]# dd if=/dev/urandom of=/tank/testfile bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 9.03607 s, 59.4 MB/s
[root@host ~]# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 2.95G 1.50G 1.45G - 10% 50% 1.00x ONLINE -
raidz2 2.95G 1.50G 1.45G - 10% 50%
sda - - - - - -
sdb - - - - - -
sdc - - - - - -
cache - - - - - -
sdd 1010M 208M 801M - 0% 20%
Alright, some of the data was already moved to the L2ARC but not all. So, read it in some more times to make it in L2ARC completely.
[root@host ~]# dd if=/tank/testfile of=/dev/null bs=512 # until L2ARC is populated with the 512MB testfile
[root@host ~]# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 2.95G 1.50G 1.45G - 11% 50% 1.00x ONLINE -
raidz2 2.95G 1.50G 1.45G - 11% 50%
sda - - - - - -
sdb - - - - - -
sdc - - - - - -
cache - - - - - -
sdd 1010M 512M 498M - 0% 50%
Okay, L2ARC is populated and ready to be read. But one needs to get rid of L1ARC first. I did the following, which have seemed to work.
[root@host ~]# echo $((64*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; echo $((1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; arc_summary.py -p1
------------------------------------------------------------------------
ZFS Subsystem Report Sun Sep 09 17:03:55 2018
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 20
Mutex Misses: 0
Evict Skips: 1
ARC Size: 0.17% 1.75 MiB
Target Size: (Adaptive) 100.00% 1.00 GiB
Min Size (Hard Limit): 6.10% 62.48 MiB
Max Size (High Water): 16:1 1.00 GiB
ARC Size Breakdown:
Recently Used Cache Size: 96.06% 1.32 MiB
Frequently Used Cache Size: 3.94% 55.50 KiB
ARC Hash Breakdown:
Elements Max: 48
Elements Current: 100.00% 48
Collisions: 0
Chain Max: 0
Chains: 0
Alright, now we are ready to read from the L2ARC (sorry for the long preface, but I thought it was important).
So running the dd if=/tank/testfile of=/dev/null bs=512
command again, I was watching zpool iostat -v 5
in a second terminal.
To my surprise, the file was read from the normal vdevs instead of the L2ARC, although the file sits in L2ARC. This is the only file in the filesystem and no other activity is active during my tests.
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.50G 1.45G 736 55 91.9M 96.0K
raidz2 1.50G 1.45G 736 55 91.9M 96.0K
sda - - 247 18 30.9M 32.0K
sdb - - 238 18 29.8M 32.0K
sdc - - 250 18 31.2M 32.0K
cache - - - - - -
sdd 512M 498M 0 1 85.2K 1.10K
---------- ----- ----- ----- ----- ----- -----
I then did fiddle around with some settings like zfetch_array_rd_sz
, zfetch_max_distance
, zfetch_max_streams
, l2arc_write_boost
and l2arc_write_max
, setting them to an odd high number. But nothing did change.
After changing
l2arc_noprefetch=0
(default is1
)- or
zfs_prefetch_disable=1
(default is0
) - toggle both from their defaults
the reads are served from the L2ARC. Again running dd if=/tank/testfile of=/dev/null bs=512
and watching zpool iostat -v 5
in a second terminal and get rid of L1ARC.
[root@host ~]# echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch
[root@host ~]# echo $((64*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; echo $((1024*1024*1024)) > /sys/module/zfs/parameters/zfs_arc_max; sleep 5s; arc_summary.py -p1
...
[root@host ~]# dd if=/tank/testfile of=/dev/null bs=512
And the result:
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.50G 1.45G 0 57 921 102K
raidz2 1.50G 1.45G 0 57 921 102K
sda - - 0 18 0 34.1K
sdb - - 0 18 0 34.1K
sdc - - 0 19 921 34.1K
cache - - - - - -
sdd 512M 497M 736 0 91.9M 1023
---------- ----- ----- ----- ----- ----- -----
Now data is read from L2ARC, but only after toggling the module parameters mentioned above.
I also have read that L2ARC can be sized too big. But threads I have found about that topic were referring to performance problems or the space map for the L2ARC spoiling the L1ARC.
Performance is not my problem here, and as far as I can tell the space map for the L2ARC is also not that big.
[root@host ~]# grep hdr /proc/spl/kstat/zfs/arcstats
hdr_size 4 279712
l2_hdr_size 4 319488
As already mentioned, I am not sure if that is the intended behavior or if I am missing something.
So after reading up on this topic, mainly this post, it seems this is the default behaviour of ZFS.
What happens is that the file makes its way into L1ARC after being read and due to blocks being accessed it is considered to be put into L2ARC.
Now on a second read of the file, ZFS is doing prefetching on the file, which bypasses the L2ARC although the blocks of the file are stored in L2ARC.
By disabling prefetching completely with
zfs_prefetch_disable=1
or tell ZFS to do prefetching on L2ARC withl2arc_noprefetch=0
, reads will make use of the blocks of the file residing in L2ARC.This might be desired if your L2ARC is large enough compared to the file sizes that are being reading.
But one might want to only put
metadata
into the L2ARC withzfs set secondarycache=metadata tank
. This prevents big files ending up in L2ARC and never being read. Since this would spoil the L2ARC and might evict blocks of smaller files not being prefetched and metadata, which you want to keep in L2ARC.I haven't found a way to tell ZFS to put only small files into the L2ARC and not merging the prefetch candidates into L2ARC. So for now, depending on file sizes and L2ARC size one has to make the tradeoff.
A different approach seems to be available in the ZoL 0.8.0 release, where it is possible to use different Allocation Classes and should make it possible to e.g. put your metadata on fast SSDs, while leaving data blocks on slow rotating disks. This will still leave the contention small files vs. big files for L2ARC, but will solve the fast access on metadata issue.
What happens in this case is that ZFS is trying to preserve L2ARC bandwidth for random/non streaming reads, where hitting the physical disks would wreak havoc on performance. Streaming reads are served quite well from mechanical HDDs, and any pool with 6/8+ disks will probably outperform any SATA L2ARC device for sequential reads. And any medium sized zpool (ie: 24/48+ disks) will give plenty of sequential real bandwidth.
As you found, you can alter L2ARC to let it behave more similarly to a victim cache (ie: store anything evicted from ARC; if a block is found on L2ARC, do not even try to access the main pool). On some specific setups this can be a good thing; however, ZFS was (correctly) architected to preserve L2ARC wear/usage for where it can be really advantageous: to cache really used blocks for faster random reads performance.