I need to store 100k files (around 40GB) in a USB drive. Each file has a unique int id (e.g 45000).
Option one is to put all files in a single folder:
root/
root/1.pdf
root/2.pdf
root/3.pdf
...
root/567.pdf
root/568.pdf
root/569.pdf
...
root/10001.pdf
root/10002.pdf
root/10003.pdf
...
root/99998.pdf
root/99999.pdf
root/100000.pdf
Option two is to create a [1-9][0-9]*
folder hierarchy based on that id:
root/
root/1/file.pdf
root/2/file.pdf
root/3/file.pdf
...
root/5/6/7/file.pdf
root/5/6/8/file.pdf
root/5/6/9/file.pdf
...
root/1/0/0/0/1/file.pdf
root/1/0/0/0/2/file.pdf
root/1/0/0/0/3/file.pdf
...
root/9/9/9/9/8/file.pdf
root/9/9/9/9/9/file.pdf
root/1/0/0/0/0/0/file.pdf
Which option will scale better? I can understand that the second option will require tons of folders but each folder will at most contain 10 folders and 1 file. Maintenance will not be an issue since everything will be controlled by an application.
Note that this is a USB drive on linux and based on the above I'd also like to know whether I should go with FAT32 or NTFS.
I would recommend ext3/4 for use with Linux as my personal preference.
For the file structure I would recommend option number 3 (a balance of directory depth and files per directory). This is really just about choosing a tree data structure. To achieve this for the files I would do a md5sum hash of each file and use the first x characters of each file as directories. The characters will always be hexidecemial characters so each branch will be 16 directories wide. The number of characters you chose will be the height of the tree structure.
For example:
Would go in a something like
./0/3/b/y.tab.h
.How to pre-create directories on linux for file storage? shows you how to precreate the directories.
This is a generic solution that works pretty well for many use cases and should create a pretty good distribution of files.
Some level of folder-tree system is definitely recommended; avoid putting more than about 10k files per folder. Don't use FAT if NTFS or EXT will work.
In cases where you are using a directory structure, I'd put 100-200 files in the directory. So, at the top level, perhaps this:
And below that...
FAT slows down as the number of files on the disk increases and not per directory, so you should not have any differences.
Read this M$ paper about FAT32