EDIT: clarify context
I have several loosely synchronized filesystems on different machines. (some content is redundant, some is not, things get moved by hand by the users). These are large scientifical datasets (many tens of terabytes); They move across clusters depending on the kind of work we perform on them. They have no standard naming convention (files sometimes get renamed as the various experiments go, or when subsets of files are selected or merged).
I'd like to find a tool that allows me to efficiently find redundancy across remote filesystems, so that we can delete redundant data, and copy non-redundant data when decomissioning storage bricks. (Side note: distributed filesystems like Ceph promise to handle these cases; this will be the future route, but now we have to deal with the existing system as-is)
Since many objects have been moved and renamed by hand, I cannot rely on their file names to compare with diff or rsync. I'd rather use a crypto checksum such as sha256 to identify my data files.
I don't want to checksum the whole dataset every time I run a comparison either. The files, once created, are not likely to change often, so the checksums should be cached.
Is there an existing tool to do this ? Maybe something that stores a checksum in a Posix Extended Attribute (using the timestamp to check the checksum freshness), and a tool that can extract that information to efficiently diff the contents of the filesystems, without caring about the filenames ?
I'm unaware of a filesystem-level checksumming; you could script (or hand-craft) using md5sum and store it in a text file for comparison, and there are ports of md5sum for multiple platforms.
If these are large files, you could consider setting up a system that lets users duplicate data using bittorrent; it has a built-in way of checksumming data and if you have several places that store the files you gain added benefits from not loading down one or two systems with transfers.
You might want to consider changing a policy on how the data is being duplicated or moved around if you're managing the systems or data; this would probably result in you losing less hair if something were to go wrong, and your users may thank you if something happens and "this time" the data wasn't backed up by Bob down the hall. You don't need to do anything too elaborate if you're working in an existing infrastructure; even a couple servers running a periodic rsync over the network (which would also be relatively fast since it transfers only changes in large files when transferring over networks, not so much if it thinks it's a local file) will create synced files.
I would caution that duplicating files like that and using checksums isn't technically a backup; it's a duplicate. Backups means that when your master file is corrupt you can "roll back" to a previous version (wanna set up something similar to CVS to check out your large data files?...) while duplicating, even with checksums, means that if your original is corrupted (accidental deletion, bad sector in the drive, etc.) that corruption will get copied out, checksum and all, to your duplicates, rendering them useless. You'll want to plan for that scenario.
Since I did not find a tool that does what I want, I started rolling my own:
http://bitbucket.org/maugier/shatag
--EDIT--
After developing that tool i've learned about git-annex that is different from what I was aiming at, but is an ideal solution nonetheless.
Maybe you can use rsync with the option --dry-run (-n). It will try to copy (but without doing anything) and you will see the differences. There is plenty options concerning filtering (timestamps, owner and much more) to define exactly what you want.
Someone already mentioned "rsync".
If you can mount the 2nd file system on the 1st machine, you can try running "diff -r /localfs /remotefs" and see the differences.
You could also try something like tripwire or AIDE to snapshot one tree and compare it to the other.
Depending on the size of the data set in question, you might consider using git or some other efficient version control program to take periodic "snapshots" (automatic, unattended adds and commits) for tracking changes. You can even sync specific changes from one machine to the other using this method if you set it up correctly.
For deduplication, the "fdupes" program works well.