I want to develop an automated process for checking that every machine on our domain is getting backed up. I'm wondering how other people do it.
We have a SAN (virtualized windows and linux servers and three SQL Servers) and a couple of NAS in our data center, a couple dozen physical DCs in the field (Win Server 2016), and a few hundred workstations (soon to be all Win 10). We keep Veam snapshots for a month locally, then push them to AWS after that.
Recently we needed to restore an Excel file that was used to update a table on one of our SQL Servers. We failed. The share on the NAS where the file lived was not being backed up. When we created the backup process, the share was barely used and I'm sure we chose not to back it up on purpose. But as we gradually started using that share for more important stuff, we didn't change the process.
Next we tried to restore the data from the SQL server. This server was added in the last year and, while it was being backed up, we missed the part where we pushed it to AWS so we only had it going back one month.
We should have backed up that share from the beginning - important or not. And we should have been pushing the new SQL backups to AWS. My take away from all this is that there are too many places for human error in our process.
One idea we had was to get every machine from Active Directory and select a "random" file from each drive/share (excluding system files and executables) and see if we can find it in our backups. We could automate the selection process with PowerShell. I'm not sure about automating the checking of our backups, but hopefully there's a way. If we had to manually do it for a few hundred files it would be better than nothing.
Is there a best practice for backup completeness? Is there something better than the humans-being-careful method?