Ping a Specific Port

Question

David Dean

Asked: 2012-09-10 17:58:00 +0800 CST2012-09-10 17:58:00 +0800 CST 2012-09-10 17:58:00 +0800 CST

How can I compare two directories to compare missing files, when the directories don't have the same structure?

772

I've been sent a HDD of new and updated files from an organisation that we are working with, but we already have most of the files sitting on our servers, and would like to update our local versions to match theirs.

Normally, this would be a job for something like rsync, but our problem is that the directory structure they provide is very poorly organised and we've had to rearrange their files in the past to work best with our systems.

So, my question is:

How can I find out which files in the set they have provided are new or different to the versions that we have, when the directory structures are different?

Once that question is answered, we can update the changed files, and work out where to put the new files on our system, probably somewhat manually.

2 Answers

Voted

David Dean · Answer 1 · 2012-09-10T17:58:00+08:00

Ok, here is my first attempt at something. It seems to work moderately well for what I need, but I am open to better suggestions:

First, get md5sums of all the files in both our filesystem and the new data:

find /location/of/data -type f -exec md5sum {} ';' > our.md5sums
find /media/newdisk -type f -exec md5sum {} ';' > their.md5sums

And I wrote a short python script called md5diff.py:

#!/usr/bin/env python
import sys
print "Comparing", sys.argv[1], "to", sys.argv[2]

# Create a dictionary based upon the hashes in source B
dict = {}
for line in open(sys.argv[2]):
    p = line.partition(' ')
    dict[p[0]] = p[2].strip()


# Now go through source A and report where the file is in source B
for line in open(sys.argv[1]):
    p = line.partition(' ')
    if p[0] in dict:
        print line.strip(), "(", sys.argv[2], ":",dict[p[0]], ")"
    else:
        print line.strip(), "NOT IN", sys.argv[2]

So now I can use

./md5diff.py their.md5sums our.md5sums

And if I add in a | grep "NOT IN" it will only list the files on their media that we don't already have (or is different from what we have). From their I can start to manually attack the known differences.

Joel E Salas · Answer 2 · 2012-09-10T23:05:55+08:00

Joel E Salas

2012-09-10T23:05:55+08:002012-09-10T23:05:55+08:00

You don't have to MD5 to compare modification time changes. With that said, you could probably (barring a huge data set) copy the new and updated files to local storage, use a tool like fslint to identify duplicates, then use modification times (not just MD5sums) to reconcile everything else.

One important question is, how do you know if a file has been updated if the path isn't the same on the new storage? If file names aren't unique ("Sales Report August 2012.xls" could apply to many departments, for example), then how do you know when you are updating an existing file versus overwriting an existing file with unrelated content?

I would err on the side of caution and keep everything, file paths included. You can identify identical files and create symlinks to the originals for a poor man's deduplication system, but in reality your storage system should handle that for you. The worst-case scenario is trashing user data just to save space.

1

How can I compare two directories to compare missing files, when the directories don't have the same structure?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?