Background: physical server, about two years old, 7200-RPM SATA drives connected to a 3Ware RAID card, ext3 FS mounted noatime and data=ordered, not under crazy load, kernel 2.6.18-92.1.22.el5, uptime 545 days. Directory doesn't contain any subdirectories, just millions of small (~100 byte) files, with some larger (a few KB) ones.
We have a server that has gone a bit cuckoo over the course of the last few months, but we only noticed it the other day when it started being unable to write to a directory due to it containing too many files. Specifically, it started throwing this error in /var/log/messages:
ext3_dx_add_entry: Directory index full!
The disk in question has plenty of inodes remaining:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda3 60719104 3465660 57253444 6% /
So I'm guessing that means we hit the limit of how many entries can be in the directory file itself. No idea how many files that would be, but it can't be more, as you can see, than three million or so. Not that that's good, mind you! But that's part one of my question: exactly what is that upper limit? Is it tunable? Before I get yelled at—I want to tune it down; this enormous directory caused all sorts of issues.
Anyway, we tracked down the issue in the code that was generating all of those files, and we've corrected it. Now I'm stuck with deleting the directory.
A few options here:
rm -rf (dir)
I tried this first. I gave up and killed it after it had run for a day and a half without any discernible impact.
- unlink(2) on the directory: Definitely worth consideration, but the question is whether it'd be faster to delete the files inside the directory via fsck than to delete via unlink(2). That is, one way or another, I've got to mark those inodes as unused. This assumes, of course, that I can tell fsck not to drop entries to the files in /lost+found; otherwise, I've just moved my problem. In addition to all the other concerns, after reading about this a bit more, it turns out I'd probably have to call some internal FS functions, as none of the unlink(2) variants I can find would allow me to just blithely delete a directory with entries in it. Pooh.
while [ true ]; do ls -Uf | head -n 10000 | xargs rm -f 2>/dev/null; done )
This is actually the shortened version; the real one I'm running, which just adds some progress-reporting and a clean stop when we run out of files to delete, is:
export i=0; time ( while [ true ]; do ls -Uf | head -n 3 | grep -qF '.png' || break; ls -Uf | head -n 10000 | xargs rm -f 2>/dev/null; export i=$(($i+10000)); echo "$i..."; done )
This seems to be working rather well. As I write this, it has deleted 260,000 files in the past thirty minutes or so.
- As mentioned above, is the per-directory entry limit tunable?
- Why did it take "real 7m9.561s / user 0m0.001s / sys 0m0.001s" to delete a single file which was the first one in the list returned by
ls -U
, and it took perhaps ten minutes to delete the first 10,000 entries with the command in #3, but now it's hauling along quite happily? For that matter, it deleted 260,000 in about thirty minutes, but it's now taken another fifteen minutes to delete 60,000 more. Why the huge swings in speed? - Is there a better way to do this sort of thing? Not store millions of files in a directory; I know that's silly, and it wouldn't have happened on my watch. Googling the problem and looking through SF and SO offers a lot of variations on
find
that are not going to be significantly faster than my approach for several self-evident reasons. But does the delete-via-fsck idea have any legs? Or something else entirely? I'm eager to hear out-of-the-box (or inside-the-not-well-known-box) thinking.
Final script output!:
2970000...
2980000...
2990000...
3000000...
3010000...
real 253m59.331s
user 0m6.061s
sys 5m4.019s
So, three million files deleted in a bit over four hours.
Update August 2021
This answer continues to attract a lot of attention and I feel as if its so woefully out of date it kind of is redundant now.
Doing a
find ... -delete
is most likely going to produce acceptable results in terms of performance.The one area I felt might result in a higher performance is tackling the 'removing' part of the problem instead of the 'listing' part.
I tried it and it didn't work. But I felt it was useful to explain what I did and why.
In todays newer kernels, through the use of the IO uring subsystem in the kernel (see
man 2 io_uring_setup
) it is actually possible to attempt to perform unlinks asynchronously -- meaning we can submit unlink requests without waiting or blocking to see the result.This program basically reads a directory, submits hundreds of
unlinks
without waiting for the result, then reaps the results later once the system is done handling the request.It tries to do what
dentls
did but uses IO uring. Can be compiled withgcc -o dentls2 dentls2.c -luring
.The results were ironically opposite what I suspected, but why?
TMPFS with 4 million files
Using find:
BTRFS with 10 million files
Using find:
So it looks as if batched syscalls dont make an improvement in real time. The new
dentls2
spends much more time working (four times as much) only to result in worse performance. So a net loss in overall efficiency and worse latency.dentls2
is worse.The cause of this is because io_uring produces kernel dispatcher threads to do the unlink work internally, but the directory inode being worked on can only be modified by a single writer at one time.
Basically using the
uring
we're creating lots of little threads but only one thread is allowed to delete from the directory. We've just created a bunch of contention and eliminated the advantage of doing batched IO.Using eBPF you can measure the unlink frequencies and watch what causes the delays.
In the case of BTRFS its the kernel function call
btrfs_commit_inode_delayed_inode
which acquires the lock whenunlink
is called.With
dentls2
Using
find ... -delete
:You can see from the histogram that
find
spends 3258 nanoseconds on average inbtrfs_commit_inode_delayed_inode
butdentls2
spends 8533 nanoseconds in the function.Also the histogram shows that overall io_uring threads spend at least twice as long waiting on the lock which the majority of calls taking 4096-8091 nanoseconds versus the majority in
find
taking 2048-4095 nanoseconds.Find is single-threaded and isn't contending for the lock, whereas `dentls2 is multi-threaded (due to the uring) which produces lock contention and the delays that are experienced are reflected in the analysis.
Conclusion
All in all, on modern systems (as of writing this) there is less and less you can do in software to make this go faster than it is set to go.
It used to be reading a large buffer from the disk you could compound an expensive IO call down into one large sequential read, instead of seeky IO which small getdents() buffers could typically end up being.
Also due to other improvements there are smaller overheads to just invoking system calls and major improvements in sequential/random IO access times that eliminate the big IO bottlenecks we used to experience.
On my systems, this problem has become memory/cpu bound. Theres a single-accessor problem on (at least) BTRFS which limits the speed you can go to a single cpu/programs worth of unlinks per directory at a time. Trying to batch the IO's yields at best minor improvements even in ideal circumstances of using tmpfs and typically is worse on a real-world filesystem.
To top it off, we really dont have this problem anymore -- gone are the days of 10 million files taking 4 hours to remove.
Just do something simple like
find ... -delete
. No amount of optimization I tried seemed to yield major performance improvements worth the coding (or analysis) over a default simple setup.Original Answer
Whilst a major cause of this problem is ext3 performance with millions of files, the actual root cause of this problem is different.
When a directory needs to be listed readdir() is called on the directory which yields a list of files. readdir is a posix call, but the real Linux system call being used here is called 'getdents'. Getdents list directory entries by filling a buffer with entries.
The problem is mainly down to the fact that that readdir() uses a fixed buffer size of 32Kb to fetch files. As a directory gets larger and larger (the size increases as files are added) ext3 gets slower and slower to fetch entries and additional readdir's 32Kb buffer size is only sufficient to include a fraction of the entries in the directory. This causes readdir to loop over and over and invoke the expensive system call over and over.
For example, on a test directory I created with over 2.6 million files inside, running "ls -1|wc-l" shows a large strace output of many getdent system calls.
Additionally the time spent in this directory was significant.
The method to make this a more efficient process is to call getdents manually with a much larger buffer. This improves performance significantly.
Now, you're not supposed to call getdents yourself manually so no interface exists to use it normally (check the man page for getdents to see!), however you can call it manually and make your system call invocation way more efficient.
This drastically reduces the time it takes to fetch these files. I wrote a program that does this.
Whilst this does not combat the underlying fundamental problem (lots of files, in a filesystem that performs poorly at it). It's likely to be much, much faster than many of the alternatives being posted.
As a forethought, one should remove the affected directory and remake it after. Directories only ever increase in size and can remain poorly performing even with a few files inside due to the size of the directory.
Edit: I've cleaned this up quite a bit. Added an option to allow you to delete on the command line at runtime and removed a bunch of the treewalk stuff which, honestly looking back was questionable at best. Also was shown to produce memory corruption.
You can now do
dentls --delete /my/path
New results. Based off of a directory with 1.82 million files.
Was kind of surprised this still works so well!
The
data=writeback
mount option deserves to be tried, in order to prevent journaling of the file system. This should be done only during the deletion time, there is a risk however if the server is being shutdown or rebooted during the delete operation.According to this page,
The option is set either in
fstab
or during the mount operation, replacingdata=ordered
withdata=writeback
. The file system containing the files to be deleted has to be remounted.Would it be possible to backup all of the other files from this file system to a temporary storage location, reformat the partition, and then restore the files?
There is no per directory file limit in ext3 just the filesystem inode limit (i think there is a limit on the number of subdirectories though).
You may still have problems after removing the files.
When a directory has millions of files, the directory entry itself becomes very large. The directory entry has to be scanned for every remove operation, and that takes various amounts of time for each file, depending on where its entry is located. Unfortunately even after all the files have been removed the directory entry retains its size. So further operations that require scanning the directory entry will still take a long time even if the directory is now empty. The only way to solve that problem is to rename the directory, create a new one with the old name, and transfer any remaining files to the new one. Then delete the renamed one.
I haven't benchmarked it, but this guy did:
TLDR: use
rsync -a --delete emptyfolder/ x
.This question has 50k views, and quite a few answers, but nobody seems to have benchmarked all the different replies. There's one link to an external benchmark, but that one's over 7 years old and didn't look at the program provided in this answer: https://serverfault.com/a/328305/565293
Part of the difficulty here is that the time it takes to remove a file depends heavily on the disks in use and the file system. In my case, I tested both with a consumer SSD running BTRFS on Arch Linux (updated as of 2020-03), but I got the same ordering of the results on a different distribution (Ubuntu 18.04), filesystem (ZFS), and drive type (HDD in a RAID10 configuration).
Test setup was identical for each run:
Test results:
rm -rf x
: 30.43sfind x/ -type f -delete
: 29.79perl -e 'for(<*>){((stat)[9]<(unlink))}'
: 37.97srsync -a --delete empty/ x
: 25.11s(The following is the program from this answer, but modified to not print anything or wait before it deletes files.)
./dentls --delete x
: 29.74The
rsync
version proved to be the winner every time I repeated the test, although by a pretty low margin. Theperl
command was slower than any other option on my systems.Somewhat shockingly, the program from the top answer to this question proved to be no faster on my systems than a simple
rm -rf
. Let's dig into why that is.First of all, the answer claims that the problem is that
rm
is usingreaddir
with a fixed buffer size of 32Kb withgetdents
. This proved not to be the case on my Ubuntu 18.04 system, which used a buffer four times larger. On the Arch Linux system, it was usinggetdents64
.In addition, the answer misleadingly provides statistics giving its speed at listing the files in a large directory, but not removing them (which is what the question was about). It compares
dentls
tols -u1
, but a simplestrace
reveals thatgetdents
is not the reason whyls -u1
is slow, at least not on my system (Ubuntu 18.04 with 1000000 files in a directory):This
ls
command makes a million calls tolstat
, which slows the program way down. Thegetdents
calls only add up to 0.455 seconds. How long do thegetdents
calls take indentls
on the same folder?That's right! Even though
dentls
only makes 12 calls instead of 245, it actually takes the system longer to run these calls. So the explanation given in that answer is actually incorrect - at least for the two systems I've been able to test this on.The same applies to
rm
anddentls --delete
. Whereasrm
takes 0.42s callinggetdents
,dentls
takes 0.53s. In either case, the vast majority of the time is spent callingunlink
!So in short, don't expect to see massive speedups running
dentls
, unless your system is like the author's and has a lot of overhead on individualgetdents
. Maybe the glibc folks have considerably sped it up in the years since the answer was written, and it now takes a linear about of time to respond for different buffer sizes. Or maybe the response time ofgetdents
depends on the system architecture in some way that isn't obvious.find simply did not work for me, even after changing the ext3 fs's parameters as suggested by the users above. Consumed way too much memory. This PHP script did the trick - fast, insignificant CPU usage, insignificant memory usage:
I posted a bug report regarding this trouble with find: http://savannah.gnu.org/bugs/?31961
I recently faced a similar issue and was unable to get ring0's
data=writeback
suggestion to work (possibly due to the fact that the files are on my main partition). While researching workarounds I stumbled upon this:This will turn off journaling completely, regardless of the
data
option give tomount
. I combined this withnoatime
and the volume haddir_index
set, and it seemed to work pretty well. The delete actually finished without me needing to kill it, my system remained responsive, and it's now back up and running (with journaling back on) with no issues.Make sure you do:
which should speed things up a bit as well.
ls very slow command. Try: