Ping a Specific Port

Question

Googlebot

Asked: 2011-09-19 11:02:38 +0800 CST2011-09-19 11:02:38 +0800 CST 2011-09-19 11:02:38 +0800 CST

How to distribute files and folders to handle large number of files

772

I want to manage a huge number of files on my server (say millions). It is needed to save files in two or three levels of folders to keep the number of files in each folder low. On the other hand, it is not good to have many folders to spend inodes.

How much is the optimum ratio of files per folder? Is there a theoretical approach to determine this, or it depends on the server specifications?

1 Answers

Voted

BillThor · Answer 1 · 2011-09-19T11:52:24+08:00

Server specifications are likely to be less of an issue than the file system you are using. Different file systems have different approaches to storing directory data. This will impact the scanning speed at various sizes.

Another important consideration is the lifecycle of the files. If you have frequent addition and deletion of files you may want the leaf directories to be smaller than they might otherwise might be.

You may want to look at the cache directory structures used by the Apache web server and Squid proxy. These are are well tested caches which handle relatively high rates of change, and scale well.

EDIT: The answer to your question depends significantly on the life-cycle and access patterns of the files. These factors will significantly influence the disk I/O and buffer memory requirements. Number of files is likely to be a less significant factor.

Besides file system chosen, memory, disk interfaces, number of disks, and raid setup will all impact disk access performance. Performance needs to be sufficient to requirements with some leeway.

Disk setup tends to be more important as writes and deletes increase. It can also be more important as access to files becomes more random. These factors tend to increase the requirement for disk throughput.

Increasing memory generally makes it more likely that files are accessed from disk buffers than disk. This will increase file access performance for most systems. Access to many large files may result in poorer disk caching.

For most systems I have worked with, the likelihood a file will be accessed is related to when it was last accessed. The more recently a file was accessed the more likely it will be accessed again. Hashing algorithms tend to be important in optimizing retrieval in these cases. If file access is truly random, this is less significant.

The disk I/O required to delete a file may be significantly higher than adding a file. Many systems have significant problems deleting large numbers of files from large directories. The higher rate of file additions and deletions, the more significant this becomes. File lifecycle is an important factor when considering these factors.

Backups are another issue and may need to be scheduled so they don't cause disk buffering issues. Newer systems allow IO to be niced so backups and other maintenance programs have less impact on the application.

How to distribute files and folders to handle large number of files

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?