I'm looking for a way to efficiently manage and leverage file-level checksums for all files in a filesystem over time.
Goals:
Configurable, fast refresh - only re-checksumming large files when other criteria indicate a likely change (file size, timestamp, first and last block changed, etc.). I say "configurable" because some use cases can't trust that timestamps haven't been changed, etc.
Fast query for a specific checksum (In other words, answering the question "Do I already have this file?") across the whole filesystem
A way to compare the data across filesystems (either natively within the solution, or machine-readable export so that a comparison could be scripted)
Support for multiple hashes
Duplicate file reporting (I don't expect the solution to walk me through an interactive deduplication session; machine-readable report output would be fine)
Nice-to-have: a way to optionally (re-)generate traditional checksum files in each directory ("CHECKSUM", "MD5SUM", or similar) so that subdirectories exposed via FTP or web can easily consume the checksums
The key idea is for the hashes to be cached in such a way that they can be both quickly updated and quickly queried.
0 Answers