I've been sent a HDD of new and updated files from an organisation that we are working with, but we already have most of the files sitting on our servers, and would like to update our local versions to match theirs.
Normally, this would be a job for something like rsync, but our problem is that the directory structure they provide is very poorly organised and we've had to rearrange their files in the past to work best with our systems.
So, my question is:
How can I find out which files in the set they have provided are new or different to the versions that we have, when the directory structures are different?
Once that question is answered, we can update the changed files, and work out where to put the new files on our system, probably somewhat manually.
Ok, here is my first attempt at something. It seems to work moderately well for what I need, but I am open to better suggestions:
First, get md5sums of all the files in both our filesystem and the new data:
And I wrote a short python script called md5diff.py:
So now I can use
And if I add in a
| grep "NOT IN"
it will only list the files on their media that we don't already have (or is different from what we have). From their I can start to manually attack the known differences.You don't have to MD5 to compare modification time changes. With that said, you could probably (barring a huge data set) copy the new and updated files to local storage, use a tool like fslint to identify duplicates, then use modification times (not just MD5sums) to reconcile everything else.
One important question is, how do you know if a file has been updated if the path isn't the same on the new storage? If file names aren't unique ("Sales Report August 2012.xls" could apply to many departments, for example), then how do you know when you are updating an existing file versus overwriting an existing file with unrelated content?
I would err on the side of caution and keep everything, file paths included. You can identify identical files and create symlinks to the originals for a poor man's deduplication system, but in reality your storage system should handle that for you. The worst-case scenario is trashing user data just to save space.