I have an application writing to an ext3 directory which over time has grown to roughly three million files. Needless to say, reading the file listing of this directory is unbearably slow.
I don't blame ext3. The proper solution would have been to let the application code write to sub-directories such as ./a/b/c/abc.ext
rather than using only ./abc.ext
.
I'm changing to such a sub-directory structure and my question is simply: roughly how many files should I expect to store in one ext3 directory while still getting acceptable performance? What's your experience?
Or in other words; assuming that I need to store three million files in the structure, how many levels deep should the ./a/b/c/abc.ext
structure be?
Obviously this is a question that cannot be answered exactly, but I'm looking for a ball park estimate.
Provided you have a distro that supports the
dir_index
capability then you can easily have 200,000 files in a single directory. I'd keep it at about 25,000 though, just to be safe. Withoutdir_index
, try to keep it at 5,000.Be VERY careful how you select the directory split. "a/b/c" sounds like a recipe for disaster to me...
Do not just blindly go making a several directory deep structure, say 100 entries in the first level, 100 entries in the second level, 100 entries in the third. I've been there, done that, got the jacket and had to restructure it when performance went in the crapper with a few million files. :-)
We have a client that did the "multiple directories" layout, and ends up putting just one to five files per directory, and this was killing them. 3 to 6 hours to do a "du" in this directory structure. The savior here was SSD, they were unwilling to rewrite this part of their application, and an SSD took this du time down from hours to minutes.
The problem is that each level of directory lookups takes seeks, and seeks are extremely expensive. The size of the directory is also a factor, so having it be smaller rather than larger is a big win.
To answer your question about how many files per directory, 1,000 I've heard talked about as "optimum" but performance at 10,000 seems to be fine.
So, what I'd recommend is one level of directories, each level being a directory 2 characters long, made up of upper and lowercase letters and the digits, for around 3800 directories in the top level. You can then hold 14M files with those sub-directories containing 3800 files, or around 1,000 files per sub-directory for 3M files.
I have done a change like this for another client, and it made a huge difference.
I would suggest you try testing various directory sizes with a benchmarking tool such as postmark, because there are a lot of variables like cache size (both in the OS and in the disk subsystem) that depend on your particular environment.
My personal rule of thumb is to aim for a directory size of <= 20k files, although I've seen relatively decent performance with up to 100k files/directory.
I have all files go folders like:
uploads/[date]/[hour]/yo.png
and don't have any performance problems.
http://en.wikipedia.org/wiki/Ext3#Functionality - This mentions that a directory can only have approximately 32000 subdirectories, but makes no mention of files.
http://roopindersingh.com/2008/05/10/ext3-handling-large-number-of-files-in-a-directory/
Also, I hate Experts Exchange, but I read a comment on this question that it's ideal to have less than 10-15,000 per directory.
I can confirm on a pretty powerful server with plenty of memory under a decent load that 70,000 files can cause all sorts of havoc. I went to remove a cache folder with 70k files in it and it cause apache to start spawning new instances until it maxed out at 255 and the system used all free memory (16gb although the virtual instance may have been lower). Either way, keeping it under 25,000 is probably a very prudent move
In my experience, the best approach is to not over-engineer the file structure in advance. As mentioned in at least one other answer, there are filesystem extensions that deal with the performance-issue end of things.
The problem that I have hit more frequently is usability on the administrative end. The least amount of work you can do to decrease the number of files in a directory is probably the approach you need right now.
sqrt(3_000_000) == 1732
A couple thousand files in a single directory sounds reasonable to me. Be your own judge for your own situation. To achieve this, try splitting the files into a single level of hash directories so that the average number of files per directory is about the same as the number of directories.
Given your example this would be
./a/abc.ext
,./ab/abc.ext
,./abc/abc.ext
, ... .The spread of files will depend heavily upon the actual filenames. Imagine applying this technique to a directory of a million files each named
foobar???.txt
. There are ways to accomplish a more even spread, like hashing based on the value of a particular number of bits from the MD5 sum of each filename, but I'm going to dare guess that would be overkill for what you are trying to accomplish.Hmm, I read this article recently. Essentially you leverage the distribution of your favorite hashing algorithm. I started playing with the numbers, a MySQL signed INT has a maximum value of 2147483647. You can also vary the desired number of files per directory and number of sub-directories to settle on the final number-of-sub-directories/files-per-directory split for a given data set, but it's hard to find empirical evidence on optimal directory/file organizations. This article does give some insight into performance differences across filesystems (some interesting metrics), but nothing about optimal organizations.
I think you're putting too much thought into this. If you even chose a single additional level of directories and were able to balance things evenly, you'd have 1732* directories and 1732 files per directory.
Unless you plan on needing tens of billions of files, you could pretty much pick a number between 1000 and 100,000 and get good results.
* square root of 3 million.