I'm doing the setup for a large-scale storage farm, and to avoid the need for month-long fscks, my plan is to split the storage into numerous smaller filesystems (this is fine, as I have a well-bucketed file tree, so I can easily have separate filesystems mounted on 1/
, 2/
, 3/
, 4/
, etc).
My difficulty is in finding any enumeration of what a "reasonable" size is for a filesystem, to keep fsck times similarly "reasonable". Whilst I'm fully aware that absolute time for a given size will depend largely on hardware, I can't seem to find any description of the shape of the curve for ext3 fsck times with varying filesystem sizes, and what the other variables are (does a filesystem full of files in a single directory take longer than one with 10 files in each of thousands of directories in a tree; large files vs small files; full filesystem vs empty filesystem; and so on).
Does anyone have references to any well-researched numbers about this? Failing that, any anecdotes about these issues should at least help to guide my own experimentation, should that be required.
EDIT: To clarify: regardless of the filesystem, if something goes wrong with the metadata, it will need to be checked. Whether time- or mount-based re-fscks are enabled or needed is not at issue, and the only reason I'm asking for numbers specifically regarding ext3 is because that's the most likely filesystem to be chosen. If you know of a filesystem that has a particularly fast fsck process, I'm open to suggestions, but it does need to be a robust option (claims that "filesystem X never needs fscking!" will be laughed at and derided at length). I am also aware of the need for backups, and the desire to fsck is not a substitute for backups, however just discarding the filesystem and restoring from backup when it glitches, rather than fscking it, seems like a really, really dumb tradeoff.
According to a paper by Mathur et al. (p. 29), e2fsck time grows linearly with the amount of inodes on a filesystem after a certain point. If the graph is anything to go by, you're more effective with filesystems of up to 10 million inodes.
Switching to ext4 would help - under the condition that your filesystem is not loaded to the brim, where the performance gain (due to not checking inodes marked unused) has no discernable effect.
I think you're going to have to do your own benchmarking. A quick search on google didn't reveal anything, except that ext4 fscks a lot quicker than ext3.
So, create some ext3 partitions, 100GB, 200GB, etc up to the disk size you'll be using. Then fill them with data. If you can use data that resembles your production data, (files per directory, file size distribution, etc.) then that will be best. Note that simply copying files from another partition od backup device will place them on the disk perfectly laid out and defragmented and so your tests will lack a lot of the disk head seek times that will come from lots of write/modify/deletes.
You'll also need to give some thought to parallel fscks. See the last couple of fields in /etc/fstab. Partitions on the same physical disk should be done in sequence; multiple disks on the same controller can be done in parallel, but take care not to overload the controller and slow them down.
http://lmgtfy.com/?q=fsck+benchmark
it looks like fsck on ext4 filesystems is significantly faster than on ext3, with some reports of ext4 fsck being 10 or even more times faster than ext3.
two quite interesting articles from that search are:
http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/ and http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/
is there any reason why you can't use a filesystem that doesn't force time or mount-count based fscks on you when you reboot?
(the time-based fscks really bug me - for a long-uptime server it pretty much guarantees that you'll have to do a full fsck whenever you upgrade the kernel).
anyway, XFS is one of the journaling filesystems that don't force an fsck. worth a look.