I have a project that will generate a huge number of images. Around 1,000,000 for start. They are not large images so I will store them all on one machine at start.
How do you recommended on storing these images efficiently? (NTFS file system currently)
I am considering a naming scheme... for start all the images will have an incremental name from 1 up I hope this will help me sort them later if needed, and throw them in different folders.
what would be a better naming scheme:
a/b/c/0 ... z/z/z/999
or
a/b/c/000 ... z/z/z/999
any idea on this ?
I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.
Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:
It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.
I would use this kind of algorithm for generating the directory structure:
12345
->000000012345.jpg
000000012345
->000/000/012
123
is000/000/012/00000000012345.jpg
12345678901234
the path would be123/456/789/12345678901234.jpg
Some things to consider about directory structures and file storage:
I'm going to put my 2 cents worth in on a piece of negative advice: Don't go with a database.
I've been working with image storing databases for years: large (1 meg->1 gig) files, often changed, multiple versions of the file, accessed reasonably often. The database issues you run into with large files being stored are extremely tedious to deal with, writing and transaction issues are knotty and you run into locking problems that can cause major train wrecks. I have more practice in writing dbcc scripts, and restoring tables from backups than any normal person should ever have.
Most of the newer systems I've worked with have pushed the file storage to the file system, and relied on databases for nothing more than indexing. File systems are designed to take that sort of abuse, they're much easier to expand, and you seldom lose the whole file system if one entry gets corrupted.
I think most sites that have to deal with this use a hash of some sort to make sure that the files get evenly distributed in the folders.
So say you have a hash of a file that is something like this
515d7eab9c29349e0cde90381ee8f810
You could have this stored in the following location and you can use how ever many levels deep you need to keep the number of files in each folder low.
\51\5d\7e\ab\9c\29\349e0cde90381ee8f810.jpg
I've seen this approach taken many times. You still need a database to map these file hashes to a human readable name and what ever other metadata you need to store. But this approach scales pretty well b/c you can start to distribute the hash address space between multiple computers and or storage pools, etc.
Ideally, you should run some tests on random access times for various structures, as your specific hard drive setup, caching, available memory, etc. can change these results.
Assuming you have control over the filenames, I would partition them at the level of 1000s per directory. The more directory levels you add, the more inodes you burn, so there's a push-pull here.
E.g.,
/root/[0-99]/[0-99]/filename
Note, http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx has more details on NTFS setup. In particular, "If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name generation for better performance, and especially if the first six characters of the long file names are similar."
You should also look into disabling filesystem features you don't need (e.g., last access time). http://www.pctools.com/guides/registry/detail/50/
Whatever you do, don't store them all in one directory.
Depending on the distribution of the names of these images, you could create a directory structure where you have single letter top level folders where you would have another set of subfolders for the 2nd letter of images etc.
So:
Folder
img\a\b\c\d\e\f\g\
would contain the images starting with 'abcdefg' and so on.You could introduce your own appropriate depth required.
The great thing about this solution is that the directory structure effectively acts like a hashtable/dictionary. Given an image file name, you will know its directory and given a directory, you will know a subset of images that go there.
We have an photo store system with 4 million images. We use database only for meta data and all images are stored on the file system using an inversed naming system, where folder names are generated from last digit of the file, last-1, and so on. e.g.: 000001234.jpg is stored in directory structure like 4\3\2\1\000001234.jpg.
This scheme works very well with identity index in the database, because it fills evenly whole directory structure.
I would store these on the file system but it depends on how fast will the number of files grow. Are these files hosted on the web? How many users would access these file? These are the questions that need to be answered before I could give you a better recommendation. I would also look at Haystack from Facebook, they have a very good solution for storing and serving up images.
Also if you choose file system you will need to partition these files with directories. I been looking at this issue and proposed a solution but its not a perfect one by any means. I am partitioning by hash table and users you can read more on my blog.
The new MS SQL 2008 has a new feature to handle such cases, it's called the FILESTREAM. Take a look:
Microsoft TechNet FILESTREAM Overview
Quick point, you don't need to store a file path in you DB. You can just store a numeric value, if your files are named in the way you describe. Then using one of the well-defined storage schemes already discussed, you can get the index as a number and very quickly find the file by traversing the directory structure.
Will your images need to be named uniquely? Can the process that generates these images produce the same filename more than once? Hard to say without knowing what device is creating the filename but say that device is 'reset' and upon restart it begins naming the images as it did the last time it was 'reset' - if that is such a concern..
Also, you say that you will hit 1 million images in one month's time. How about after that? How fast will these images continue to fill the file system? Will they top off at some point and level out at about 1 million TOTAL images or will it continue to grow and grow, month after month?
I ask because you could begin designing your file system by month, then by image. I might be inclined to suggest that you store the images in such a directory structure:
Month, year, even day is good for security type images. Not sure if this is what you are doing but I did that with a home security camera that snapped a photo every 10 seconds... This way your application can drill down to specific time or even a range where you might think the image was generated. Or, instead of year, month - is there some other "meaning" that can be derived from the image file itself? Some other descriptors, other than the date example I gave?
I would not store the binary data in the DB. Never had good performance / luck with that sort of thing. Cant imagine it working well with 1 million images. I would store the filename and that is it. If they are all going to be JPG then dont even store the extension. I would create a control table that stored a pointer to the file's server, drive, path, etc. This way you can move those images to another box and still locate them. Have you a need to keyword tag your images? If so then you would want to build the appropriate tables that allow that sort of tagging.
You / others may have addressed these ideas while I was replying.. Hope this helps..