I need to serve around 70,000 static files (jpg) using nginx. Should I dump them all in a single directory, or is there a better (efficient) way ? Since the filenames are numeric, I considered having a directory structure like:
xxx/xxxx/xxx
The OS is CentOS 5.1
it really depends on the file system you're using to store the files.
some filesystems (like ext2 and to a lesser extent ext3) are hideously slow when you have thousands of files in one directory, so using subdirectories is a very good idea.
other filesystems, like XFS or reiserfs(*), don't slow down with thousands of files in one directory, so it doesn't matter whether you have one big directory or lots of smaller subdirectories.
(*) reiserfs has some nice features but it's an experimental toy that has a history of catastrophic failures. don't use it on anything even remotely important.
Benchmark, benchmark, benchmark! You'll probably find no significant difference between the two options, meaning that your time is better spent on other problems. If you do benchmark and find no real difference, go with whichever scheme is easier -- what's easy to code if only programs have to access the files, or what's easy for humans to work with if people need to frequently work with the files.
As to whichever one is faster, directory lookup time is, I believe, proportional to the logarithm of the number of files in the directory. So each of three lookups for the nested structure will be faster than one big lookup, but the total of all three will probably be larger.
But don't trust me, I don't have a clue what I'm doing! Measure performance when it matters!
As others have said, directory hashing is very probably going to be most optimal.
What I would suggest you do though is make your URIs independent of whatever directory scheme you use, using nginx's rewrite module, e.g. map example.com/123456.jpg to /path/12/34/123456.jpg
Then if your directory structure needs to change for performance reasons you can change that without changing your published URIs.
Doing some basic directory hashing is generally a good idea. Even if your file system deals well with 70k files; having say millions of files in a directory would become unmanageable. Also - how does your backup software like many files in one directory, etc etc.
That being said: To get replication (redundancy) and easier scalability consider storing the files in MogileFS instead of just in the file system. If the files are small-ish and some files are much more popular than others, consider using Varnish (varnish-cache.org) to serve them Very Quickly.
Another idea: Use a CDN -- they are surprisingly cheap. We use one that costs basically the same as we pay for "regular bandwidth"; even at low usage (10-20Mbit/sec).
You could put a squid cache in front on your nginx server. Squid can either keep the popular images in memory, or use it's own file layout for fast look ups.
For Squid, the default is 16 level one directories and 256 level two. These are reasonable defaults for my file systems.
If you don't use a product like Squid, and create your own file structure, then you'll need to come up with a reasonable hashing algorithm for your files. If the file names are randomly generated this is easy, and you can use the file name itself to divide up into buckets. If all your files look like IMG_xxxx, then you'll either need to use the least significant digits, or hash the file name and divide up based on that hash number.
As others have mentioned, you need to test to see what layout works best for you for your setup and usage pattern.
However, you may also want to look at the open_file_cache parameter inside nginx. See http://wiki.nginx.org/NginxHttpCoreModule#open_file_cache
By all means benchmark and use that information to help you make a decision but if it was my system I would also be giving some consideration to long term maintenance. Depending on what you need to do it may be easier to manage things if there is a directory structure instead of everything in one directory.
Splitting them into directories sounds like a good idea. Basically (as you may know) the reason for this approach is that having too many files in one directory makes the directory index huge and causes the OS to take a long time to search through it; conversely, having too many levels of (in)direction (sorry, bad pun) means doing a lot of disk lookups for every file.
I would suggest splitting the files into one or two levels of directories - run some trials to see what works best. If there are several images among the 70,000 that are significantly more popular than the others, try putting all those into one directory so that the OS can use a cached directory index for them. Or in fact, you could even put the popular images into the root directory, like this:
...hopefully you see the pattern. On Linux, you could use hard links for the popular images (but not symlinks, that decreases efficiency AFAIK).
Also think about how people are going to be downloading the images. Is any individual client going to be requesting only a few images, or the whole set? Because in the latter case, it makes sense to create a TAR or ZIP archive file (or possibly several archive files) with the images in them, since transferring a few large files is more efficient than a lot of smaller ones.
P.S. I sort of got carried away in the theory but kquinn is right, you really do need to run some experiments to see what works best for you, and it's very possible that the difference will be insignficant.
I think its a good idea to break the files up in a hierarchy, for no other reason that if you ever need to drop down and do an ls on the directory it will take less time.
I don't know aboutext4, but stock ext2 cannot handle that many files in one dir, reiserfs (reiser3) was designed to handle that well (an ls will still be ugly).