Ping a Specific Port

Question

Ahsan

Asked: 2009-07-12 20:18:32 +0800 CST2009-07-12 20:18:32 +0800 CST 2009-07-12 20:18:32 +0800 CST

Optimum way to serve 70,000 static files (jpg)?

772

I need to serve around 70,000 static files (jpg) using nginx. Should I dump them all in a single directory, or is there a better (efficient) way ? Since the filenames are numeric, I considered having a directory structure like:

xxx/xxxx/xxx

The OS is CentOS 5.1

12 Answers

Voted

cas · Answer 1 · 2009-07-12T21:29:53+08:00

cas

2009-07-12T21:29:53+08:002009-07-12T21:29:53+08:00

it really depends on the file system you're using to store the files.

some filesystems (like ext2 and to a lesser extent ext3) are hideously slow when you have thousands of files in one directory, so using subdirectories is a very good idea.

other filesystems, like XFS or reiserfs(*), don't slow down with thousands of files in one directory, so it doesn't matter whether you have one big directory or lots of smaller subdirectories.

(*) reiserfs has some nice features but it's an experimental toy that has a history of catastrophic failures. don't use it on anything even remotely important.

6

kquinn · Answer 2 · 2009-07-12T20:23:18+08:00

Best Answer

kquinn

2009-07-12T20:23:18+08:002009-07-12T20:23:18+08:00

Benchmark, benchmark, benchmark! You'll probably find no significant difference between the two options, meaning that your time is better spent on other problems. If you do benchmark and find no real difference, go with whichever scheme is easier -- what's easy to code if only programs have to access the files, or what's easy for humans to work with if people need to frequently work with the files.

As to whichever one is faster, directory lookup time is, I believe, proportional to the logarithm of the number of files in the directory. So each of three lookups for the nested structure will be faster than one big lookup, but the total of all three will probably be larger.

But don't trust me, I don't have a clue what I'm doing! Measure performance when it matters!

4

Alnitak · Answer 3 · 2009-11-20T03:54:27+08:00

Alnitak

2009-11-20T03:54:27+08:002009-11-20T03:54:27+08:00

As others have said, directory hashing is very probably going to be most optimal.

What I would suggest you do though is make your URIs independent of whatever directory scheme you use, using nginx's rewrite module, e.g. map example.com/123456.jpg to /path/12/34/123456.jpg

Then if your directory structure needs to change for performance reasons you can change that without changing your published URIs.

4

Ask Bjørn Hansen · Answer 4 · 2009-07-13T00:20:18+08:00

Ask Bjørn Hansen

2009-07-13T00:20:18+08:002009-07-13T00:20:18+08:00

Doing some basic directory hashing is generally a good idea. Even if your file system deals well with 70k files; having say millions of files in a directory would become unmanageable. Also - how does your backup software like many files in one directory, etc etc.

That being said: To get replication (redundancy) and easier scalability consider storing the files in MogileFS instead of just in the file system. If the files are small-ish and some files are much more popular than others, consider using Varnish (varnish-cache.org) to serve them Very Quickly.

Another idea: Use a CDN -- they are surprisingly cheap. We use one that costs basically the same as we pay for "regular bandwidth"; even at low usage (10-20Mbit/sec).

3

brianegge · Answer 5 · 2009-07-13T19:23:52+08:00

brianegge

2009-07-13T19:23:52+08:002009-07-13T19:23:52+08:00

You could put a squid cache in front on your nginx server. Squid can either keep the popular images in memory, or use it's own file layout for fast look ups.

For Squid, the default is 16 level one directories and 256 level two. These are reasonable defaults for my file systems.

If you don't use a product like Squid, and create your own file structure, then you'll need to come up with a reasonable hashing algorithm for your files. If the file names are randomly generated this is easy, and you can use the file name itself to divide up into buckets. If all your files look like IMG_xxxx, then you'll either need to use the least significant digits, or hash the file name and divide up based on that hash number.

3

Jauder Ho · Answer 6 · 2009-07-12T22:26:49+08:00

Jauder Ho

2009-07-12T22:26:49+08:002009-07-12T22:26:49+08:00

As others have mentioned, you need to test to see what layout works best for you for your setup and usage pattern.

However, you may also want to look at the open_file_cache parameter inside nginx. See http://wiki.nginx.org/NginxHttpCoreModule#open_file_cache

1

John Gardeniers · Answer 7 · 2009-07-13T04:11:29+08:00

John Gardeniers

2009-07-13T04:11:29+08:002009-07-13T04:11:29+08:00

By all means benchmark and use that information to help you make a decision but if it was my system I would also be giving some consideration to long term maintenance. Depending on what you need to do it may be easier to manage things if there is a directory structure instead of everything in one directory.

1

David Z · Answer 8 · 2009-07-12T20:30:32+08:00

Splitting them into directories sounds like a good idea. Basically (as you may know) the reason for this approach is that having too many files in one directory makes the directory index huge and causes the OS to take a long time to search through it; conversely, having too many levels of (in)direction (sorry, bad pun) means doing a lot of disk lookups for every file.

I would suggest splitting the files into one or two levels of directories - run some trials to see what works best. If there are several images among the 70,000 that are significantly more popular than the others, try putting all those into one directory so that the OS can use a cached directory index for them. Or in fact, you could even put the popular images into the root directory, like this:

images/
  021398012.jpg
  379284790.jpg
  ...
  000/
    000/
      000000000.jpg
      000000001.jpg
      ...
    001/
      ...
    002/
      ...

...hopefully you see the pattern. On Linux, you could use hard links for the popular images (but not symlinks, that decreases efficiency AFAIK).

Also think about how people are going to be downloading the images. Is any individual client going to be requesting only a few images, or the whole set? Because in the latter case, it makes sense to create a TAR or ZIP archive file (or possibly several archive files) with the images in them, since transferring a few large files is more efficient than a lot of smaller ones.

P.S. I sort of got carried away in the theory but kquinn is right, you really do need to run some experiments to see what works best for you, and it's very possible that the difference will be insignficant.

Nick Anderson · Answer 9 · 2009-07-12T21:47:53+08:00

Nick Anderson

2009-07-12T21:47:53+08:002009-07-12T21:47:53+08:00

I think its a good idea to break the files up in a hierarchy, for no other reason that if you ever need to drop down and do an ls on the directory it will take less time.

0

Ronald Pottol · Answer 10 · 2009-07-13T19:09:32+08:00

Ronald Pottol

2009-07-13T19:09:32+08:002009-07-13T19:09:32+08:00

I don't know aboutext4, but stock ext2 cannot handle that many files in one dir, reiserfs (reiser3) was designed to handle that well (an ls will still be ugly).

0

Optimum way to serve 70,000 static files (jpg)?

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?