Ping a Specific Port

Question

s.mihai

Asked: 2009-12-18 08:52:17 +0800 CST2009-12-18 08:52:17 +0800 CST 2009-12-18 08:52:17 +0800 CST

Storing a million images in the filesystem

772

I have a project that will generate a huge number of images. Around 1,000,000 for start. They are not large images so I will store them all on one machine at start.

How do you recommended on storing these images efficiently? (NTFS file system currently)

I am considering a naming scheme... for start all the images will have an incremental name from 1 up I hope this will help me sort them later if needed, and throw them in different folders.

what would be a better naming scheme:

a/b/c/0 ... z/z/z/999

or

a/b/c/000 ... z/z/z/999

any idea on this ?

24 Answers

Voted

Juha Syrjälä · Answer 1 · 2009-12-18T09:32:12+08:00

I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
- 12345 -> 000000012345.jpg
Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
- 000000012345 -> 000/000/012
Store the file to under generated directory:
- Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
- For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
Directory structure itself will take some disk space, so you'll do not want too many directories.
With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
If you need to access files from several machines, consider sharing the files via a network file system.
The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.

Satanicpuppy · Answer 2 · 2009-12-18T09:12:25+08:00

Satanicpuppy

2009-12-18T09:12:25+08:002009-12-18T09:12:25+08:00

I'm going to put my 2 cents worth in on a piece of negative advice: Don't go with a database.

I've been working with image storing databases for years: large (1 meg->1 gig) files, often changed, multiple versions of the file, accessed reasonably often. The database issues you run into with large files being stored are extremely tedious to deal with, writing and transaction issues are knotty and you run into locking problems that can cause major train wrecks. I have more practice in writing dbcc scripts, and restoring tables from backups than any normal person should ever have.

Most of the newer systems I've worked with have pushed the file storage to the file system, and relied on databases for nothing more than indexing. File systems are designed to take that sort of abuse, they're much easier to expand, and you seldom lose the whole file system if one entry gets corrupted.

32

3dinfluence · Answer 3 · 2009-12-18T12:17:16+08:00

3dinfluence

2009-12-18T12:17:16+08:002009-12-18T12:17:16+08:00

I think most sites that have to deal with this use a hash of some sort to make sure that the files get evenly distributed in the folders.

So say you have a hash of a file that is something like this 515d7eab9c29349e0cde90381ee8f810
You could have this stored in the following location and you can use how ever many levels deep you need to keep the number of files in each folder low.
\51\5d\7e\ab\9c\29\349e0cde90381ee8f810.jpg

I've seen this approach taken many times. You still need a database to map these file hashes to a human readable name and what ever other metadata you need to store. But this approach scales pretty well b/c you can start to distribute the hash address space between multiple computers and or storage pools, etc.

13

Jason Yanowitz · Answer 4 · 2009-12-18T09:01:20+08:00

Jason Yanowitz

2009-12-18T09:01:20+08:002009-12-18T09:01:20+08:00

Ideally, you should run some tests on random access times for various structures, as your specific hard drive setup, caching, available memory, etc. can change these results.

Assuming you have control over the filenames, I would partition them at the level of 1000s per directory. The more directory levels you add, the more inodes you burn, so there's a push-pull here.

E.g.,

/root/[0-99]/[0-99]/filename

Note, http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx has more details on NTFS setup. In particular, "If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name generation for better performance, and especially if the first six characters of the long file names are similar."

You should also look into disabling filesystem features you don't need (e.g., last access time). http://www.pctools.com/guides/registry/detail/50/

12

Wim Hollebrandse · Answer 5 · 2009-12-18T08:58:53+08:00

Wim Hollebrandse

2009-12-18T08:58:53+08:002009-12-18T08:58:53+08:00

Whatever you do, don't store them all in one directory.

Depending on the distribution of the names of these images, you could create a directory structure where you have single letter top level folders where you would have another set of subfolders for the 2nd letter of images etc.

So:

Folder img\a\b\c\d\e\f\g\ would contain the images starting with 'abcdefg' and so on.

You could introduce your own appropriate depth required.

The great thing about this solution is that the directory structure effectively acts like a hashtable/dictionary. Given an image file name, you will know its directory and given a directory, you will know a subset of images that go there.

8

user30438 · Answer 6 · 2009-12-31T14:10:38+08:00

user30438

2009-12-31T14:10:38+08:002009-12-31T14:10:38+08:00

We have an photo store system with 4 million images. We use database only for meta data and all images are stored on the file system using an inversed naming system, where folder names are generated from last digit of the file, last-1, and so on. e.g.: 000001234.jpg is stored in directory structure like 4\3\2\1\000001234.jpg.

This scheme works very well with identity index in the database, because it fills evenly whole directory structure.

6

Lukasz · Answer 7 · 2009-12-18T08:59:06+08:00

Lukasz

2009-12-18T08:59:06+08:002009-12-18T08:59:06+08:00

I would store these on the file system but it depends on how fast will the number of files grow. Are these files hosted on the web? How many users would access these file? These are the questions that need to be answered before I could give you a better recommendation. I would also look at Haystack from Facebook, they have a very good solution for storing and serving up images.

Also if you choose file system you will need to partition these files with directories. I been looking at this issue and proposed a solution but its not a perfect one by any means. I am partitioning by hash table and users you can read more on my blog.

5

Padu Merloti · Answer 8 · 2009-12-18T09:24:29+08:00

Padu Merloti

2009-12-18T09:24:29+08:002009-12-18T09:24:29+08:00

The new MS SQL 2008 has a new feature to handle such cases, it's called the FILESTREAM. Take a look:

Microsoft TechNet FILESTREAM Overview

5

Mr. Boy · Answer 9 · 2009-12-18T09:18:27+08:00

Mr. Boy

2009-12-18T09:18:27+08:002009-12-18T09:18:27+08:00

Quick point, you don't need to store a file path in you DB. You can just store a numeric value, if your files are named in the way you describe. Then using one of the well-defined storage schemes already discussed, you can get the index as a number and very quickly find the file by traversing the directory structure.

4

Taptronic · Answer 10 · 2009-12-18T10:24:26+08:00

Will your images need to be named uniquely? Can the process that generates these images produce the same filename more than once? Hard to say without knowing what device is creating the filename but say that device is 'reset' and upon restart it begins naming the images as it did the last time it was 'reset' - if that is such a concern..

Also, you say that you will hit 1 million images in one month's time. How about after that? How fast will these images continue to fill the file system? Will they top off at some point and level out at about 1 million TOTAL images or will it continue to grow and grow, month after month?

I ask because you could begin designing your file system by month, then by image. I might be inclined to suggest that you store the images in such a directory structure:

imgs\yyyy\mm\filename.ext

where: yyyy = 4 digit year
         mm = 2 digit month

example:  D:\imgs\2009\12\aaa0001.jpg
          D:\imgs\2009\12\aaa0002.jpg
          D:\imgs\2009\12\aaa0003.jpg
          D:\imgs\2009\12\aaa0004.jpg
                   |
          D:\imgs\2009\12\zzz9982.jpg
          D:\imgs\2010\01\aaa0001.jpg (this is why I ask about uniqueness)
          D:\imgs\2010\01\aab0001.jpg

Month, year, even day is good for security type images. Not sure if this is what you are doing but I did that with a home security camera that snapped a photo every 10 seconds... This way your application can drill down to specific time or even a range where you might think the image was generated. Or, instead of year, month - is there some other "meaning" that can be derived from the image file itself? Some other descriptors, other than the date example I gave?

I would not store the binary data in the DB. Never had good performance / luck with that sort of thing. Cant imagine it working well with 1 million images. I would store the filename and that is it. If they are all going to be JPG then dont even store the extension. I would create a control table that stored a pointer to the file's server, drive, path, etc. This way you can move those images to another box and still locate them. Have you a need to keyword tag your images? If so then you would want to build the appropriate tables that allow that sort of tagging.

You / others may have addressed these ideas while I was replying.. Hope this helps..

Storing a million images in the filesystem

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?