I have written a buggy program that has accidentally created about 30M files under /tmp. (The bug was introduced some weeks ago, and it was creating a couple of subdirectories per second.) I could rename /tmp to /tmp2, and now I need to delete the files. The system is FreeBSD 10, the root filesystem is zfs.
Meanwhile one of the drives in the mirror went wrong, and I have replaced it. The drive has two 120GB SSD disks.
Here is the question: replacing the hard drive and resilvering the whole array took less than an hour. Deleting files /tmp2 is another story. I have written another program to remove the files, and it can only delete 30-70 subdirectories per second. It will take 2-4 days to delete all files.
How is it possible that resilvering the whole array takes an hour, but deleting from the disk takes 4 days? Why do I have so bad performance? 70 deletions/second seems very very bad performance.
I could delete the inode for /tmp2 manually, but that will not free up the space, right?
Could this be a problem with zfs, or the hard drives or what?
Deletes in ZFS are expensive. Even more so if you have deduplication enabled on the filesystem (since dereferencing deduped files is expensive). Snapshots could complicate matters too.
You may be better off deleting the
/tmp
directory instead of the data contained within.If
/tmp
is a ZFS filesystem, delete it and create again.Consider an office building.
Removing all of the computers and furniture and fixings from all the offices on all the floors takes a long time, but leaves the offices immediately usable by another client.
Demolishing the whole building with RDX is a whole lot quicker, but the next client is quite likely to complain about how drafty the place is.
There's a number of things going on here.
First, all modern disk technologies are optimised for bulk transfers. If you need to move 100MB of data, they'll do it much faster if they're in one contiguous block instead of scattered all over the place. SSDs help a lot here, but even they prefer data in contiguous blocks.
Second, resilvering is pretty optimal as far as disk operations goes. You read a massive contiguous chunk of data from one disk, do some fast CPU ops on it, then rewrite it in another big contiguous chunk to another disk. If power fails partway through, no big deal - you'll just ignore any data with bad checksums and carry on as per normal.
Third, deleting a file is really slow. ZFS is particularly bad, but practically all filesystems are slow to delete. They must modify a large number of different chunks of data on the disk and time it correctly (i.e. wait) so the filesystem is not damaged if power fails.
Resilvering is something that disks are really fast at, and deletion is something that disks are slow at. Per megabyte of disk, you only have to do a little bit of resilvering. You might have a thousand files in that space which need to be deleted.
It depends. I would not be surprised by this. You haven't mentioned what type of SSD you're using. Modern Intel and Samsung SSDs are pretty good at this sort of operation (read-modify-write) and will perform better. Cheaper/older SSDs (e.g. Corsair) will be slow. The number of I/O operations per second (IOPS) is the determining factor here.
ZFS is particularly slow to delete things. Normally, it will perform deletions in the background so you don't see the delay. If you're doing a huge number of them it can't hide it and must delay you.
Appendix: why are deletions slow?
Ian Howson gives a good answer on why it is slow.
If you delete files in parallel you may see an increase in speed due to the deletion may use the same blocks and thus can save rewriting the same block many times.
So try:
and see if that performs better than your 70 deletes per second.
It is possible because the two operations work on different layers of the file system stack. Resilvering can run low-level and does not actually need to look at individual files, copying large chunks of data at a time.
It does have to do a lot of bookkeeping...
I don't know for ZFS, but if it could automatically recover from that, it would likely, in the end, do the same operations you are already doing, in the background.
Does
zfs scrub
say anything?Deleting lots of files is never really a fast operation.
In order to delete a file on any filesystem, you need to read the file index, remove (or mark as deleted) the file entry in the index, remove any other metadata associated with the file, and mark the space allocated for the file as unused. This has to be done individually for each file to be deleted, which means deleting lots of files requires lots of small I/Os. To do this in a manner which ensures data integrity in the event of power failure adds even more overhead.
Even without the peculiarities ZFS introduces, deleting 30 million files typically means over a hundred million separate I/O operations. This will take a long time even with a fast SSD. As others have mentioned, the design of ZFS further compounds this issue.
Very simple if you invert your thinking.
Get a second drive (you seem to have this already)
Copy everything from drive A to drive B with rsync, excluding the /tmp directory. Rsync will be slower than a block copy.
Reboot, using drive B as the new boot volume
Reformat drive A.
This will also defragment your drive and give you a fresh directory (fine, defrag is not so important with an SSD but linearizing your files never hurt anything)
You have 30million entries in an unsorted list. You scan the list for the entry you want to remove and you remove it. Now you have only 29,999,999 entries in your unsorted list. If they are all in /tmp, why not just reboot?
Edited to reflect the information in the comments: Statement of problem: Removing most, but not all, of the 30M+ incorrectly created files in /tmp is taking a long time.
Problem 1) Best way to remove large numbers of unwanted files from /tmp.
Problem 2) Understanding why it is so slow to delete files.
Solution 1) - /tmp is reset to empty at boot by most *nix distributions. FreeBSD however, is not one of them.
Step 1 - copy interesting files somewhere else.
Step 2 - As root
Step 3 - reboot.
Step 4 - change clear_tmp_enable back to "No".
Unwanted files are now gone as ZFS on FreeBSD has the feature that "Destroying a dataset is much quicker than deleting all of the files that reside on the dataset, as it does not involve scanning all of the files and updating all of the corresponding metadata." so all it has to do at boot time is reset the metadata for the /tmp dataset. This is very quick.
Solution 2) Why is it so slow? ZFS is a wonderful file system which includes such features as constant time directory access. This works well if you know what you are doing, but the evidence suggests that the OP is not a ZFS expert. The OP has not indicated how they were attempting to remove the files, but at a guess, I would say they used a variation on "find regex -exec rm {} \;". This works well with small numbers but does not scale because there are three serial operations going on 1) get the list of available files (returns 30 million files in hash order), 2) use regex to pick the next file to be deleted, 3) tell the OS to find and remove that file from a list of 30 million. Even if ZFS returns a list from memory and if 'find' caches it, the regex still has to identify the next file to be processed from the list and then tell the OS to update its metadata to reflect that change and then update the list so it isn't processed again.