UMBC CMSC 461 Spring '99

Lecture 25
Recovery System

Chapter 15, Database Systems Concepts, by Silberschatz, et al, 1997

Portions reproduced with permission

A computer system, like any other mechanical or electrical system is subject to failure. There are a variety of causes, including disk crash, power failure, software errors, a fire in the machine room, or even sabotage. Whatever the cause, information may be lost. The database must take actions in advance to ensure that the atomicity and durability properties of transactions are preserved. An integral part of a database system is a recovery scheme that is responsible for the restoration of the database to a consistent stage that existed prior to the occurrence of the failure.

Failure Classification

The major types of failures involving data integrity (as opposed to data security) are:

Transaction Failure:

Logical error. The transaction can not continue with its normal execution because of such things as bad input, data not found, or resource limit exceeded.

System error. The system has entered an undesirable state (for example, deadlock), as a result of which a transaction can not continue with its normal execution. The transaction, however, can be reexecuted at a later time.

System Crash. There is a hardware malfunction, or a bug in the database software or the operating system, that causes the loss of the content of volatile storage, and brings transaction processing to a halt. The content of the nonvolatile storage remains intact, and is not corrupted.

Disk Failure. A disk block loses its contents as a result of either a head crash or failure during a data transfer. Copies of data on other disks, or archival backups on tertiary media, such as tapes, are used to recover from the failure.

Storage Structure

Volatile storage. Information here does not usually survive system crashes.

Nonvolatile storage. This information normally does survive system crashes, but can be lost (in a head crash, etc).

Stable storage. System designed not to loss data.

Stable Storage Implementation

To implement stable storage, we need to replicate the needed information on several nonvolatile media with independent failure modes and to update the information in a controlled manner to ensure that failure during data transfer does not damage the needed information.

RAID systems guarantee that the failure of a single disk will not result in the loss of data. The simplest and fasted form of RAID is the mirrored disk, which keeps two copies of each block, on separate disks. RAID systems can not guarantee failure of a site! The most secure systems keep a copy of each block of stable storage at a remote site, writing it out over a computer network, in addition to storing it on a local disk system.

Transfer of data between memory and disk storage can result in:

Successful completion. The transferred information arrived safely at its destination.

Partial failure. A failure occurred in the midst of a transfer and the destination block as incorrect information.

Total failure. The failure occurred sufficiently early during the transfer that the destination block remains intact.

If a data transfer failure occurs, the system must detect it and invoke a recovery procedure to restore the block to a consistent state. To do so, the system must maintain two physical blocks for each logical database block.

A transfer of a block of data would be:

Write the information onto the first physical block.
Write the same information onto the second physical block.

During recovery, each pair of physical blocks is examined. If both of them are the same and no detectable error exists, then no further actions are necessary.

If one block contains a detectable error, then we replace its content with the contents of the other block.

When the blocks contain different data without a detectable error in either, then the contents of the second block are written to the first block.

Hopefully, this recovery procedure ensures that a write to stable storage either succeeds completely or results in no change.

Log-Based Recovery

The most widely used structure for recording database modifications is the log. The log is a sequence of log records and maintains a history of all update activities in the database. There are several types of log records.

An update log record describes a single database write:

Transactions identifier.
Data-item identifier.
Old value.
New value.

Whenever a transaction performs a write, it is essential that the log record for that write be created before the database is modified. Once a log record exists, we can output the modification that has already been output to the database. Also we have the ability to undo a modification that has already been output to the database, by using the old-value field in the log records.

For log records to be useful for recovery from system and disk failures, the log must reside on stable storage. However, since the log contains a complete record of all database activity, the volume of data stored in the log may become unreasonable large.

Deferred Database Modification

The deferred-modification technique ensures transaction atomicity by recording all database modifications in the log, but deferring all write operations of a transaction until the transaction partially commits (i.e., once the final action of the transaction has been executed). Then the information in the logs is used to execute the deferred writes. If the system crashes or if the transaction aborts, then the information in the logs is ignored.

Immediate Database Modification

The immediate-update technique allows database modifications to be output to the database while the transaction is still in the active state. These modifications are called uncommitted modifications. In the event of a crash or transaction failure, the system must use the old-value field of the log records to restore the modified data items.

Checkpoints

When a system failure occurs, we must consult the log to determine those transactions that need to be redone and those that need to be undone. Rather than reprocessing the entire log, which is time-consuming and much of it unnecessary, we can use checkpoints:

Output onto stable storage all the log records currently residing in main memory.
Output to the disk all modified buffer blocks.
Output onto stable storage a log record, <checkpoint>.

Now recovery will be to only process log records since the last checkpoint record.

Shadow Paging

Shadow paging is an alternative to log-based recovery techniques, which has both advantages and disadvantages. It may require fewer disk accesses, but it is hard to extend paging to allow multiple concurrent transactions. The paging is very similar to paging schemes used by the operating system for memory management.

The idea is to maintain two page tables during the life of a transaction: the current page table and the shadow page table. When the transaction starts, both tables are identical. The shadow page is never changed during the life of the transaction. The current page is updated with each write operation. Each table entry points to a page on the disk. When the transaction is committed, the shadow page entry becomes a copy of the current page table entry and the disk block with the old data is released. If the shadow is stored in nonvolatile memory and a system crash occurs, then the shadow page table is copied to the current page table. This guarantees that the shadow page table will point to the database pages corresponding to the state of the database prior to any transaction that was active at the time of the crash, making aborts automatic.

There are drawbacks to the shadow-page technique:

Commit overhead. The commit of a single transaction using shadow paging requires multiple blocks to be output -- the current page table, the actual data and the disk address of the current page table. Log-based schemes need to output only the log records.

Data fragmentation. Shadow paging causes database pages to change locations (therefore, no longer contiguous.

Garbage collection. Each time that a transaction commits, the database pages containing the old version of data changed by the transactions must become inaccessible. Such pages are considered to be garbage since they are not part of the free space and do not contain any usable information. Periodically it is necessary to find all of the garbage pages and add them to the list of free pages. This process is called garbage collection and imposes additional overhead and complexity on the system.

Recovery with Concurrent Transactions

Regardless of the number of concurrent transactions, the disk has only one single disk buffer and one single log. These are shared by all transactions. The buffer blocks are shared by a transactions. We allow immediate updates, and permit a buffer block to have data items updated by one or more transactions.

Buffer Management

Log-Record Buffering

The cost of performing the output of a block to stable storage is sufficiently high that it is desirable to output multiple log records at once, using a buffer. When the buffer is full, it is output with as few output operations as possible. However, a log record may reside in only main memory for a considerable time before it is actually written to stable storage. Such log records are lost if the system crashes. It is necessary, therefore, to write all buffers related to a transaction when it is committed. There is no problem written the other uncommitted transactions at this time.

Database Buffering

Database buffering is the standard operating system concept of virtual memory. Whenever blocks of the database in memory must be replaced, all modified data blocks and log records associated with those blocks must be written to the disk.

Operating System Role in Buffer Management

We can manage the database buffer sing one of two approaches:

The database system reserves part of main memory to serve as a buffer that the DBMS manages instead of the operating system. This means that the buffer must be kept as small as possible (because of its impact on other processes active on the CPU) and it adds to the complexity of the DBMS.
The DBMS implements its buffer within the virtual memory of the operating system. The operating system would then have to coordinate the swapping of pages to insure that the appropriate buffers were also written to disk. Unfortunately, almost all current-generation operating systems retain complete control of virtual memory. The operating system reserves space on disk for storing virtual memory pages that are not currently in main memory, called swap space. This approach may result in extra output to the disk.

Failure with Loss of Nonvolatile Storage

The basic scheme is to dump the entire content of the database to stable memory periodically. No transaction can be active during the dump procedure.

To recover from the loss of nonvolatile memory, we restore the database from the archive and all the transactions that have been committed since the most recent dump are redone.

This is also known as an archival dump. Dumps of the database and checkpointing are very similar.

Lecture 25 Recovery System