This might be a very general question but I really like to find some detailed answers or clues.
I am discussing this with a friend, trying to convince him to put more than 300,000 files from one single folder to more than one (like 1000 per sub-directory). Those files are images and to be served online web-viewing, like:
www.example.com/folder/1.png
.
.
.
www.example.com/folder/300000.png
I simply remember many years ago when I worked at a online video serving company like Youtube. We put the screenshots in one folder and then the server were always crashing. At that time a "rumor" saying people should not put many files in one folder, but we do not know the detailed reason.
So how many files should I put in one folder? If there is a limitation, why? Any recommended ways to design this?
My server information:
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 7.8 (wheezy)
Release: 7.8
Codename: wheezy
Core Build version:
Linux linode 4.1.5-x86_64-linode61 #7 SMP Mon Aug 24 13:46:31 EDT 2015 x86_64 GNU/Linux
I guess this case applies to many different kind of server software.
This isn't really a very big deal with newer filesystems such as XFS and ext4, but on older or misconfigured filesystems it can be a serious problem.
With older Linux filesystems such as ext3, a directory is just an unordered list of files.
That it is unordered is important, because it means that the only way for the system to find a file in a directory is to search it from the beginning to the end.
If a directory contains 3,000 files, it will take an average of 1,500 comparisons to find a random file in the directory. But if the directory contains 300,000 files, it will take an average of 150,000 comparisons to find a random file in that directory.
In either case, if the directory entry is not already cached in RAM, it must be loaded from disk, which would add a significant amount of time to the file access, proportionate to the size of the directory. Obviously a small dentry can be loaded faster than a large one.
Thus, it is much faster when you use a more hierarchical directory structure to separate large numbers of files into unique directories.
XFS does not suffer from this problem, as it uses a hash table to lookup directory entries. Thus it can handle a directory with hundreds of thousands of files nearly as easily as a directory with one file. But it still has the penalty of needing to load the larger data structure from disk. If you have enough RAM in the system, this isn't really a practical problem, though.
Ext4 also uses a hashed directory index.
Many file systems will slow down when a single directory contains many (tens or hundreds of thousands or millions of) files or subdirectories in a single directory and there may even be a hard upper limit as well, but if and by how much depends on both the file system you chose and which IO operations. Check Wikipedia for a comparison of file system features.
Obviously listing and sorting a directory with many files will be more costly but even retrieving a file by name can become more costly with larger directories.
A common solution is to create a multi level sub directory structure based on or derived from the file name.
How important this is depends on the file system you use, and sometimes on other aspects of how your storage is implemented. It might also depend on the patter of usage.
Performance of some older file systems used to degrade very badly when the number of files got over 1000 or so. This is less true of newer file systems, but not a complete non-issue.
With a large number of files in it, the directory node will get large. That needs to get re-written every time it changes. This can be a performance concern.
If your storage is networked, then the locking associated with writing to the directory can become an issue. E.g. if you have a cluster of web servers sharing a large directory for storing session files which change on every web hit, that is likely to perform very badly, essentially serialising access as processes wait on locking the directory node. Hashing the session files out into smaller directories means most session file accesses won't have on a given session write that requires locking.