Some of my coworkers were surprised when I told them that I can back up an SQL Server database while it's still running and wondered how that's possible. I know that SQL Server is capable of backing up a database while it is still online but I'm not sure how to explain why it's possible. My question is what effect does this have on the database?
If data is modified (by an insert, update, or delete) while the backup is running, will the backup contain those changes or will it be added to the database afterwards?
I'm assuming that the log file plays an important role here but I'm not quite sure how.
edit: Just as a note, my case involves backing up the databases using SQL Server Agent and the effects of database modifications during this process.
Full backup contains both the data and log. For data, it simply copies each page of the database into the backup, as is at the moment it reads the page. It then appends into the backup media all the 'relevant' log. This includes, at the very least, all the log between the LSN at the start of the backup operation and the LSN at the end of the backup operation. In reality there is more log usually, as it has to include all active transactions at the start of backup and log needed by replication. See Debunking a couple of myths around full database backups.
When the database is restored, all the data pages are copied out into the database files, then all the log pages are copied out into the log file(s). The database is inconsistent at this moment, since it contains data page images that may be out of sync with one another. But now a normal recovery is run. Since the log contains all the log during the backup, at the end of the recovery the database is consistent.
You can't just copy it over since there can be alterations to the database mid-copy as you alluded to in the question.
It has to be done with agents that are aware of the database functionality and then take a "snapshot" via OS functions or can use a utility to dump the database in a safe state (like mysqldump, if using mysql).
Otherwise you get a backup that can be corrupted and you won't know it until you restore it. I think Joel and Jeff recently talked about it a little on a recent StackOverflow podcast.
And you're right in that the log file is important. If the journal/log file is out of sync with the actual data, restoring the files will result in corruption.
It boils down to a backup taken using a safe state of the database, either through a database-aware agent or snapshot application or application that is aware of how to properly hook the database into dropping data without interfering with updates during the data dump then backing up the resulting file.
During backup, snapshot will be created for the database and the data will be read for backup from that snapshot. The actual live DB operations won't affect the backup operation.
There are so many ways to do this (generally speaking, no idea how MSSQL normally does it) like simply dumping the database to file while appending any changes to a log file which is committed after the dump is completed - to utilizing file system specific snapshot features like VSS on Windows.
Starting from Remus Rusanu's answer, I will make clear and complete some informations..
" First before starting the backup process, a mark is done into the transaction log file to indicate that a backup will begin. Full backup contains both the data and log. For data, it simply copies all the pages of the database that have "records" inside, in any order, into the backup, as is at the moment it reads the page, and can use a parallel processing (a fuzzy logic algorithm is used for ordering the pages, taken from the memory when it is in cache or from the disk). It then appends into the backup media all the 'relevant' log, from the beginning mark, to the last moment of page copying. This includes, at the very least, all the log between the LSN at the start of the backup operation and the LSN at the end of the backup operation. In reality there is more log usually, as it has to include all active transactions at the start of backup and log needed by replication. See Debunking a couple of myths around full database backups. Last, a new mark is done inside the transaction log to indicate the end of the backup process.
When the database need to be restored, first the process read the very first pages of the backup that contains the physical database organization (files and filegroups ak "storages") and create a physical similar organisation of empty files. Then all the data pages are copied out into the database files, then all the log pages are copied out into the log file(s). The database is inconsistent at this moment, since it contains data page images that may be out of sync with one another. But now a normal recovery is run. Since the log contains all the log during the backup, at the end of the recovery the database is consistent.
You can take whats known as a copy-only backup. Wont affect the database while its online