I have been needing to automate the moving of quite large amounts of data (possibly around 1TB/day+ in average. 60TB so far, and counting) to a separate, remote storage system.
I have so far been doing this by writing custom python code, because of some special challenges related to packing data in suitable chunks for network transfer, and for the protocols available (GridFTP in our case).
I was wondering if there are any material (articles, blog posts, or books - if they are really good) providing best practices and/or hints on how to write data management scripts in a robust and fault-tolerant way?
UPDATE/CLARIFICATION: I should add that we have more or less solved the file uploading problem (with comparing remote and local checksums). We need to handle a lot more than this though, since we are providing a "dropbox" like solution for all the projects at the HPC Center, so we basically manage everything from this dropbox interface, until a user wants a special file back from storage.
So, we need to handle projects, archiving events, meta data for all files, including creating checksums and the like, and all this in a fully automated manner, where one archiving event can consist of 5TB spread out on thousands of medium to small-size files.
This calls for writing our automation software in a robust and fault-tolerant manner, since we cannot really manually keep track of every single file that is stored.
Thus I'm looking more for general hints/best practices for writing software (and in particular for data handling/doing system calls), rather than just how to verify uploads.
Things for which I'd like to see best practices and hints on, are e.g: Exception handling, Logging, Verification, Automated tests etc ... and how to combine all this to get a solution that you can rely on without continually watching it (you should get a mail when you need to, etc).
I would use a classic checksum method, with tar archives.
Easy enough to script, protocol independent and definitely the most reliable method.
For the record, I think I found at least something myself now (but looking for more, if there is!):