I have recently been tasked with improving the backup strategy of a legacy server supporting 150 users via a terminal based interface. The issue is currently that the server has a single backup taken at 2am, and due to the nature of the application suite and languages involved (each data file is discrete and there is no enforced referential integrity between data files, but a record can be spread across multiple data files - each data file is written to sequentially in the application suite, and thus you have the potential for one file to be updated while another is not, creating an inconsistent record), the server needs to be unused during this time.
As such, we stand to lose a significant amount of work if the server was to fail in some fashion at the end of the working day, but prior to the backup being taken in the early morning.
As the server is running AIX 5.x, I have decided to implement JFS2 snapshots on the file systems that require backup, which means I can reduce the 'time off system' in the early morning backup to just that required to actually take hte backup. This will be our 'guaranteed backup'.
However, I also wish to try and mitigate the risk of a full days data loss via the taking of two 'non-guaranteed backups' during the day, without removing users from the system.
The justification here is that, if we were to encounter a full powerloss situation on the server, significant portions of the data files will get corrupted - this occurred a month ago (the UPS blew the protected circuit, taking down the server - one of those things that are never supposed to happen). However, the act of taking a snapshot will not result in corrupted data files within the snapshot, just the potential for corrupted records currently being worked on. Or, in other words, controllable, managable corruption levels that can be checked for if everyone understands that it exists in the first place.
So, the question I need to ask is:
How well does JFS2 Snapshots handle complete powerloss situations? In the incident we had last month, we lost approx 60% of our data through corruption, but how would a snapshot of that partition have faired? Would it also have suffered corruption, or would it have been OK?
For example, I have /mydata/ and I snapshot it to /mysnapshot at 6pm. At 7pm we encounter the 'worst case scenario' and /mydata is left significantly corrupt. Will the snapshot also be corrupt? How does AIX and JFS2 handle this in the background? Will the snapshot be usable?
I hasten to add that there are also tape and remote file copy backups being taken during the 2am window, so we are not relying on snapshots as the actual backup, just a means to an end of improving the backup. The extra snapshots during the day are a certain nicety rather than anything we would be reliant on.
In theory as long as the snapshot is committed to the disk (also assuming that the part of the FS or LVM that manages snapshots isn't normally written) you should be fine.
But it sounds like your application uses fsync poorly and could be improved (although it would be slightly slower) with judicious use of proper posix file symantics.
See Stewart Smith's "Eat My Data" talk at linux.conf.au 2007: http://www.linux.org.au/conf/2007/talk/278.html