For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' \ -k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
pipe
and not file as inputparallel
's--pipepart
as I found out doesn't support disk partitionsSo I'd love to hear other ways around as well.
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the
-j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.