We're building a product that is likely to generate very large XFS volumes, and I'm trying to discover the scaling bottlenecks we're likely to run into given the architecture.
As we manipulate files they get placed into directories on the XFS volumes. Due to the number of files we handle, the file-count is definitely in the tens of millions and will likely get into the hundreds of millions before too long after release. We know this because our current product behaves this way, so it is reasonable to expect our next one to do similarly.
Therefore, correct early engineering is in order.
This week the files are based on the following rough layout:
$ProjectID/$SubProjectID/[md5sum chunked into groups of 4]/file
Which gives directories that look kind of like:
0123456/001/0e15/a644/8972/19ac/b4b5/97f6/51d6/9a4d/file
The reason for chunking the md5sum is to avoid the "big pile of files/directories in one directory" problem. Due to the md5sum chunking it means that 1 file causes 8 directories to be created. This has pretty clear inode impacts, but I'm unclear on what those impacts will be for XFS once we get up to scale.
What are the impacts?
This is with kernel 2.6.32 by the way, CentOS 6.2 at the moment (this can change if needed).
In testing I've created the xfs volume with defaults, and am not using any mount-options. This is to smoke out problems early. noatime
is something of a no-brainer as we don't need it. Overall XFS tuning is another problem I need to tackle, but right now I'm concerned about the metadata multiplier effect we've engineered in right now.
I already know what a better solution will be, I just don't know if I have a case to push for the change.
Since md5sums are significantly unique in the first digits, and individual sub-projects rarely exceed 5 million files, it seems to me that we only need the first two chunks. Which would yield layouts like:
0123456/001/0e15/a644/897219acb4b597f651d69a4d/file
A completely full first and second levels would have 216 first level directories and 216 second level directories in each first level directory, for a total of 232 directories on the volume.
The hypothetical 5 million file sub-project would therefore have 216 first level directories, roughly 76 (+/- 2) second tier directories in each, and one or two third tier directories in each second tier directory.
This layout is a lot more metadata efficient. I just don't know if is worth the effort to change how things are going right now.
No major recommendations other than that XFS should scale to this. I started using the filesystem in 2003 because I needed to work around an application that could easily have 800,000 files in a single directory. ext2 and ext3 would routinely fall over in operations within those filesystems.
A lot of this depends on your application and how it accesses files (directory traversal, etc.).
If this is all on one server, I would look at external SSD journals based on your expectation of a high number of metadata operations. But you know that part. I would still push for the restructuring using the second md5 example. I mean, this is a good time to refactor, right?