I work in the research group of a large company. We do a lot of work on a grid processing system with many nodes (More than 200, I'm not sure exactly how many) and several harddrives. More than 1000TB of data.
Most of this data can be re-produced, but that requires time. A lot of the data is code which is stored in separate RCS repos, which can have their own backup, but working copies are, of course, on the normal user-drives.
Can someone point me at a best-practices document, or something about how most companies go about protecting this much data?
Thanks
There's a lot to designing an effective backup system for your business needs. You might snapshot the data to other disks and then mirror off-site (if you have another site), or send to tape, or just send to tape directly from your nodes. There may be concurrency issues of data backed up at different times - perhaps your application needs to export or quiesce first? We don't know, you didn't tell us. There's a lot of technical questions and issues.
And the first thing that needs to be addressed is your actual business needs - what's your RTO (how long can you be down until your data is restored) and RPO (how much data can you afford to lose between backup runs) ? Does this need to be part of a DR or business continuity plan, or if the building burns down, do you just not care about your data anymore?