According to this paper on Facebook's Haystack:
"Because of how the NAS appliances manage directory metadata, placing thousands of files in a directory was extremely inefficient as the directory’s blockmap was too large to be cached effectively by the appliance. Consequently it was common to incur more than 10 disk operations to retrieve a single image. After reducing directory sizes to hundreds of images per directory, the resulting system would still generally incur 3 disk operations to fetch an image: one to read the directory metadata into memory, a second to load the inode into memory, and a third to read the file contents."
I had assumed the filesystem directory metadata & inode would always be cached in RAM by the OS and a file read would usually require just 1 disk IO.
Is this "multiple disk IO's to read a single file" problem outlined in that paper unique to NAS appliances, or does Linux have the same problem too?
I'm planning to run a Linux server for serving images. Any way I can minimize the number of disk IO - ideally making sure the OS caches all the directory & inode data in RAM and each file reads would only require no more than 1 disk IO?
This depends on the filesystem being used. Some filesystems are better at the large-directory problem than others are, and yes caching does impact usage.
Older versions of EXT3 had a very bad problem handling directories with thousands of files in them, which was fixed when dir_indexes were introduced. If a dir_index is not used, retrieving a file out of a directory with thousands of files can be quite expensive. Without knowing the details, I suspect that's what the NAS device in the article was using.
Modern filesystems (the latest ext3, ext4, xfs) handle the large-dir problem a lot better than in olden days. Some of the inodes can get large, but the b-trees in common usage for indexing the directories make for very speedy
fopen
times.Yes, but you did not learn to read properly. In the paragraph that you yourself quoted it says clearly:
Appliances are low end harwdware. Too much metadata + too little RAM = NO WAY TO CACHE IT.
If you run a large file server, get one, not a low end appliance.
If you can live without updated access times on files and directories, you can save a lot of I/O requests if you mount a filesystem with the 'noatime' option.
This is done by default in Linux. If you have a good amount of RAM, you will get good caching.
It's about careful measurement. If you're main purpose is serving images, then I'd think your network traffic would be dominated by them. Further, if you're doing no caching the disk rates should approximate the network rates. Finally, if you're doing perfect caching the network rates would stay the same and the disk rates go to 0.
In other words, measure it all! I use collectl exclusively for this as do many of the users of some of the largest clusters in the world.
Just download/install it and start it up. It will log a ton of stuff which you can playback or even plot. Then look at the numbers and figure out how efficiently your caching is working.
-mark