What is the best way to backup millions of small files in a very small time period? We have less than 5 hours to backup a file system which contains around 60 million files which are mostly small files.
We have tried several solutions such as richcopy, 7z, rsync and all of them seems to have a hard time. We are looking for the most optimal way...
We are open to putting the file in an archive first or transferring the file to another location via network or hard disk transfer
thanks
I'd strongly suggest using a file storage system that allows you to snapshot the volume and backup from the snapshot. This way there's little impact on the actual server and the backup system can take its time doing the backup without concern for the main system. You don't mention an operating system but something like ZFS or a NetApp filer would allow this and both are being used for this exact function all over the place. I'm sure there are other file systems that offer this but I know these work.
Hope this helps.
I worked with a server that stored about ~20 million files where 95% are less than 4k in size and about 50% are deleted every 90 days. They use raw disk image for backup. They also create a index file of names, md5 hash and date created via a script and use that to track the contents.
The original backup solution was to load the files as blobs in a database by the md5 signature. This was phased out since creating millions of md5 hashes took longer than just making a raw image backup.
Do you really need to back up all of them every time? If you make incremental / differential backups, then you only need to back up the changes each time, rather than all files involved.
As you've looked at
rsync
, you could look at using rsnapshot, which creates a sort of incremental backup.Then backup the whole volume (partition) as "raw" device.
The bottlenecks here are going to be the file system and the HDD itself. With many small files, the FS is constantly reading metadata about the files which might be separate to the file, or the files that you are reading may not be in a nice contiguous clump on the disk. In either case, the drive head has to move around a lot.
The faster you get all those small files into bigger files, the faster your overall process will be.
Unfortunately, if all you are doing is copying those files once, then having them in a single large file like an archive will only make the process slower.
read all files > archive > backup location
VS
read all files > backup location
The optimal ways would be to either copy all the files once to a secondary location and then use the modified dates and sizes, or the archive bit, since you are using Windows, (not content examination like hashes, that would still involve reading the files) to determine which files have changed, and copy just those to the secondary location and backup from there. Or to use a system that bypasses the FS, like a RAW copy like poige suggested.
Windows Server Backup in windows 2008 and later does volume-level images, so it doesn't ahve to troll through all the millions of pieces of file metadata. It just does a snapshot (or Volume Shadow Copy in MS parlance), then backs up all used blocks in the file system in-order. Reads are sequential, so it is very fast, and writes the results to a big .vhd file on another volume or network share.
There are a couple of downsides: every backup is a full backup, there is no compression, and you can only store one "image" per target folder if you're going to a network share. You can overcome the latter with scripts, and the former with other tools like 7-zip, rsync, or any other backup/compression/deduplication tool that can handle raw files.
You'll probably end up using the command-line wbadmin interface for this; ignore the GUI, it is just too simplistic for most use cases.
This is what we did:
We bought a NAS with Windows 2008 storage server R2 on it. Created an ISCSI target which is in fact one large file (.vhd) Mounted the ISCSI target and moved all the files to the virtual disk.
Now we backup the vhd with our back-up software. To backup one big file is much faster than a lot of small files.
You can install the backup software also on the NAS and attach your tapedrive to this. That way you don't have to use double storage (mirror the data and backup the mirrored data in order to buy time)