I am running Ubuntu 24.04 using ZFS for my filesystems. This is on a laptop whose only storage device is a WD Black SN850X NVMe card. The default Ubuntu installation process configured two ZFS pools:
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
bpool 187M 1.69G 0 0 381 204
86349523-abd9-7a45-ab84-60d7622c240f 187M 1.69G 0 0 381 204
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 286G 634G 13 31 1.11M 796K
cc31ec4d-1dd2-ed4f-9f90-fa99ec5aa3a2 286G 634G 13 31 1.11M 796K
-------------------------------------- ----- ----- ----- ----- ----- -----
/tmp
is part of the root mount, which is in rpool.
My /tmp
folder briefly had over 2 million files in it due to a bug in some code. When there were this many files in it, performance took a nosedive -- even just listing files (without sorting) would pause for upwards of a second. I removed most of the files, and things are back down to a manageable level now. But, operations on the list of files in /tmp
are still slow.
When I time ls --sort=none
on e.g. /bin
, which has 2,842 entries it, I get something like:
real 0m0.088s
user 0m0.001s
sys 0m0.075s
But the same command on /tmp
, which currently has 4,444 entries:
real 0m0.472s
user 0m0.007s
sys 0m0.446s
It seems that briefly housing 2 million files has left a permanent impact on the structure of /tmp
? Is there a way to fix this? Do I just need to make a new /tmp
and cut over to it??
Somewhere above millions of files in a directory, performance will be much worse. Does not really matter which file system or how many IOPS in the block device. POSIX semantics mean a significant overhead to maintain the file in directory concept. Which then becomes an exercise in understanding file system internals.
From your flame graph, not surprised that most of the stacks originates in readdir calls. I am surprised that the top level, actually taking the time, is mostly LZ4 uncompress. Which is a fast algorithm. Hundreds of milliseconds of CPU time doing that implies lots of metadata, or lots of calls to getdents64, or something else being slow.
From what little I understand about ZFS on disk format, datasets have their own sets of objects. So yes, you could make a new tmp dataset out of the root pool and mount that over the existing /tmp. Copying data not required, as it is temporary files.
Or a tmpfs on /tmp. Simplify things by removing both ZFS and block devices.
Way too late to prevent this too many files problem, but OpenZFS does have object quotas. groupquota@group to set, and
zfs userspace
to list. Also can be set per user or project.I now have the answer to this. So, yes, it is a known issue. In the internal terminology of ZFS, "if ZAP records are deleted such that an entire leaf block of the ZAP object is emptied, the block is not reclaimed." But, not only is it a known issue, but it is a fixed issue. :-) The fix isn't yet in any shipping version but is expected to be soon.
This is the fix:
https://github.com/openzfs/zfs/pull/15888