I want to develop an automated process for checking that every machine on our domain is getting backed up. I'm wondering how other people do it.
We have a SAN (virtualized windows and linux servers and three SQL Servers) and a couple of NAS in our data center, a couple dozen physical DCs in the field (Win Server 2016), and a few hundred workstations (soon to be all Win 10). We keep Veam snapshots for a month locally, then push them to AWS after that.
Recently we needed to restore an Excel file that was used to update a table on one of our SQL Servers. We failed. The share on the NAS where the file lived was not being backed up. When we created the backup process, the share was barely used and I'm sure we chose not to back it up on purpose. But as we gradually started using that share for more important stuff, we didn't change the process.
Next we tried to restore the data from the SQL server. This server was added in the last year and, while it was being backed up, we missed the part where we pushed it to AWS so we only had it going back one month.
We should have backed up that share from the beginning - important or not. And we should have been pushing the new SQL backups to AWS. My take away from all this is that there are too many places for human error in our process.
One idea we had was to get every machine from Active Directory and select a "random" file from each drive/share (excluding system files and executables) and see if we can find it in our backups. We could automate the selection process with PowerShell. I'm not sure about automating the checking of our backups, but hopefully there's a way. If we had to manually do it for a few hundred files it would be better than nothing.
Is there a best practice for backup completeness? Is there something better than the humans-being-careful method?
Identifying process and procedure failures like you are doing is important. Constructive criticism enables improvement.
Choose the backup strategy of every storage volume, even if it is "no backup". Communicate what is permanent and what is temporary to users. Backing up everything is not required, if the durability of every storage is known.
Also have a process for reviewing backups as processes change. Whenever you hear about important projects, ask the questions "Where did you save that?" and "If the file was gone, what problems would that cause?"
Backups are useless. Restores are what you care about.
Make restores a mandatory part of testing and business continuity planning.
The verification bit will be somewhat manual, as you want to verify the restores produce something humans would want to use. But if the users actually use the restored system, they will definitely discover their important spreadsheet is missing.
Feel free to add automated integrity checks like file checksum verification, and DBMS verify procedures. But verifying data suitable for use is difficult. You may have a completely valid file, but it is a month old and the organization cannot use it. Or, a volume was intentionally not backed up, but users put important stuff on it anyway.