UMBC CMSC 461 Spring '99

Lecture 23
Storage and File Structure

Chapter 10, Database Systems Concepts, by Silberschatz, et al, 1997

Portions reproduced with permission

So far we have studied the DBMS at level of the logical model. The logical model of a database system is the correct level for the database users to focus on. The goal of a database system is to simplify and facilitate access to data. As members of the development staff and as potential Database Administrators, we need to understand the physical level better than a typical user.

Overview of Physical Storage Media

Storage media are classified by speed of access, cost per unit of data to buy the media, and by the medium's reliability. Unfortunately, as speed and cost go up, the reliability does down.

Cache is the fastest and the most costly for of storage. The type of cache referred to here is the type that is typically built into the CPU chip and is 256KB, 512KB, or 1MB. Thus, cache is used by the operating system and has no application to database, per se.

Main memory is the volatile memory in the computer system that is used to hold programs and data. While prices have been dropping at a staggering rate, the increases in the demand for memory have been increasing faster. Today's 32-bit computers have a limitation of 4GB of memory. This may not be sufficient to hold the entire database and all the associated programs, but the more memory available will increase the response time of the DBMS. There are attempts underway to create a system with the most memory that is cost effective, and to reduce the functionality of the operating system so that only the DBMS is supported, so that system response can be increased. However, the contents of main memory are lost if a power failure or system crash occurs.

Flash memory is also referred to as electrically erasable programmable read-only memory (EEPROM). Since it is small (5 to 10MB) and expensive, it has little or no application to the DBMS.

Magnetic-disk storage is the primary medium for long-term on-line storage today. Prices have been dropping significantly with a corresponding increase in capacity. New disks today are in excess of 20GB. Unfortunately, the demands have been increasing and the volume of data has been increasing faster. The organizations using a DBMS are always trying to keep up with the demand for storage. This media is the most cost-effective for on-line storage for large databases.

Optical storage is very popular, especially CD-ROM systems. This is limited to data that is read-only. It can be reproduced at a very low-cost and it is expected to grow in popularity, especially for replacing written manuals.

Tape storage is used for backup and archival data. It is cheaper and slower than all of the other forms, but it does have the feature that there is no limit on the amount of data that can be stored, since more tapes can be purchased. As the tapes get increased capacity, however, restoration of data takes longer and longer, especially when only a small amount of data is to be restored. This is because the retrieval is sequential, the slowest possible method.

Magnetic Disks

A typical large commercial database may require hundreds of disks!

Physical Characteristics of Disks

Disks are actually relatively simple. There is normally a collection of platters on a spindle. Each platter is coated with a magnetic material on both sides and the data is stored on the surfaces. There is a read-write head for each surface that is on an arm assembly that moves back and forth. A motor spins the platters at a high constant speed, (60, 90, or 120 revolutions per seconds.)

The surface is divided into a set of tracks (circles). These tracks are divided into a set of sectors, which is the smallest unit of data that can be written or read at one time. Sectors can range in size from 31 bytes to 4096 bytes, with 512 bytes being the most common. A collection of a specific track from both surfaces and from all of the platters is called a cylinder.

Platters can range in size from 1.8 inches to 14 inches. Today, 5 1/4 inches and 3 1/2 inches are the most common, because they have the highest seek times and lowest cost.

A disk controller interfaces the computer system and the actual hardware of the disk drive. The controller accepts high-level command to read or write sectors. The controller then converts the commands in the necessary specific low-level commands. The controller will also attempt to protect the integrity of the data by computing and using checksums for each sector. When attempting to read the data back, the controller recalculates the checksum and makes several attempts to correctly read the data and get matching checksums. If the controller is unsuccessful, it will notify the operating system of the failure.

The controller can also handle the problem of eliminating bad sectors. Should a sector go bad, the controller logically remaps the sector to one of the extra unused sectors that disk vendors provide, so that the reliability of the disk system is higher. It is cheaper to produce disks with a greater amount of sectors than advertised and then map out bad sectors than it is to produce disks with no bad sectors or with extremely limited possibility of sectors going bad.

There are many different types of disk controllers, but the most common ones today are SCSI, IDE, and EIDE.

One other characteristic of disks that provides an interesting performance is the distance from the read-write head to the surface of the platter. The smaller this gap is means that data can be written in a smaller area on the disk, so that the tracks can be closer together and the disk has a greater capacity. Often the distance is measured in microns. However, this means that the possibility of the head touching the surface is increased. When the head touches the surface while the surface is spinning at a high speed, the result is called a "head crash", which scratches the surface and defaces the head. The bottom line to this is that someone must replace the disk.

Performance Measures of Disks

Seek time is the time to reposition the head and increases with the distance that the head must move. Seek times can range from 2 to 30 milliseconds. Average seek time is the average of all seek times and is normally one-third of the worst-case seek time.

Rotational latency time is the time from when the head is over the correct track until the data rotates around and is under the head and can be read. When the rotation is 120 rotations per second, the rotation time is 8.35 milliseconds. Normally, the average rotational latency time is one-half of the rotation time.

Access time is the time from when a read or write request is issued to when the data transfer begins. It is the sum of the seek time and latency time.

Data-transfer rate is the rate at which data can be retrieved from the disk and sent to the controller. This will be measured as megabytes per second.

Mean time to failure is the number of hours (on average) until a disk fails. Typical times today range from 30,000 to 800,000 hours (or 3.4 to 91 years).

Optimization of Disk-Block Access

Requests for disk I/O are generated by both the file system and by the virtual memory manager found in most systems. Each request specifies the address on the disk to be referenced; that address specifies is in the form of a block number. Each block is a contiguous sequence of sectors from a single track of one platter and ranges from 512 bytes to several kilobytes of data. The lower level file manager must convert block addresses into the hardware-level cylinder, surface, and sector number.

Since access to data on disk is several orders of magnitude slower is access to data in main memory; much attention has been paid to improving the speed of access to blocks on the disk. This is also where more main memory can speed up the response time, by making sure that the data needed is in memory when it is needed.

This is the same problem that is addressed in designing operating systems, to insure the best response time from the file system manager and the virtual memory manager.

Scheduling. Disk-arm scheduling algorithms attempt to order accesses in an attempt to increase the number of accesses that can be processed in a given amount of time. The might include First-Come/First-Serve, Shortest Seek First, and elevator.

File organization. To reduce block-access time, data could be arranged on the disk in the same order that it is expected to be retrieved. (This would be storing the data on the disk in order based on the primary key.) At best, this starts to produce less and less of a benefit, as there are more inserts and deletes. Also we have little control of where on the disk things get stored. The more the data gets fragmented on the disk, the more time it takes to locate it.

Nonvolatile write buffer. Using non-volatile memory (flash memory) can be used to protect the data in memory from crashes, but it does increase the cost. It is possible that the use of an UPS would be more effective and cheaper.

Log disk. You can use a disk for writing a sequential log.

Buffering. The more information you have in buffers in main memory, the more likely you are to not have to get the information from the disk. However it is more likely that more of the memory will be wasted with information not necessary.

RAID

RAIDs are Redundant Arrays of Inexpensive Disks. There are six levels of organizing these disks:

0 -- Non-redundant Striping
1 -- Mirrored Disks
2 -- Memory Style Error Correcting Codes
3 -- Bit Interleaved Parity
4 -- Block Interleaved Parity
5 -- Block Interleaved Distributed Parity
6 -- P + Q Redundancy

Tertiary Storage

This is commonly optical disks and magnetic tapes.

Storage Access

A database is mapped into a number of different files, which are maintained by the underlying operating system. Files are organized into block and a block may contain one or more data item.

A major goal of the DBMS is to minimize the number of block transfers between the disk and memory. Since it is not possible to keep all blocks in main memory, we need to manage the allocation of the space available for the storage of blocks. This is also similar to the problems encountered by the operating system, and can be in conflict with the operating system, since the OS is concerned with processes and the DBMS is concerned with only one family of processes.

Buffer Manager

Programs in a DBMS make requests (that is, calls) on the buffer manager when they need a block from a disk. If the block is already in the buffer, the requester is passed the address of the block in main memory. If the block in not in the buffer, the buffer manager first allocates space in the buffer for the block, through out some other block, if required, to make space for the new block. If the block that is to be thrown out has been modified, it must first be written back to the disk. The internal actions of the buffer manager are transparent to the programs that issue disk-block requests.

Replacement strategy. When there is no room left in the buffer, a block must be removed from the buffer before a new one can be read in. Typically, operating systems use a least recently use (LRU) scheme. There is also a Most Recent Used (MRU) that can be more optimal for DBMSs.

Pinned blocks. A block that is not allowed to be written back to disk is said to be pinned. This could be used to store data that has not been committed yet.

Forced output of blocks. There are situations in which it is necessary to write back to the block to the disk, even though the buffer space is not currently needed. This might be done during system lulls, so that when activity picks up, a write of a modified block can be avoided in peak periods.

File Organization

Fixed-Length Records

Suppose we have a table that has the following organization:

type deposit = record
    branch-name : char(22);
    account-number : char(10);
    balance : real;
end

If each character occupies 1 byte and a real occupies 8 bytes, then this record occupies 40 bytes. If the first record occupies the first 40 bytes and the second record occupies the second 40 bytes, etc. we have some problems.
It is difficult to delete a record, because there is no way to indicate that the record is deleted. (At least one system automatically adds one byte to each record as a flag to show if the record is deleted.) Unless the block size happens to be a multiple of 40 (which is extremely unlikely), some records will cross block boundaries. It would require two block access to read or write such a record.

One solution might be to compress the file after each deletion. This will incur a major amount of overhead processing, especially on larger files. Additionally, there is the same problem on inserts!

Another solution would be to have two sets of pointers. One that would link the current record to the next logical record (linked list) plus a free list (a list of free slots.) This increases the size the file.

Variable-Length Records

We can use variable length records:

Storage of multiple record types in one file.
Record types that allow variable lengths for one or more fields
Record types that allow repeating fields.

A simple method for implementing variable-length records is to attach a special end-of-record symbol at the end of each record. But this has problems:

To easy to reuse space occupied formerly by a deleted record.
There is no space in general for records to grow. If a variable-length record is updated and needs more space, it must be moved. This can be very costly.

It could be solved:

By making a variable-length into a fixed length.
By using pointers to point to fixed length records, chained together by pointers.

As you can see, there is not an easy answer.

Organization of Records in Files

Heap File Organization

Any record can be placed anywhere in the file. There is no ordering of records and there is a single file for each relation.

Sequential File Organization

Records are stored in sequential order based on the primary key.

Hashing File Organization

Any record can be placed anywhere in the file. A hashing function is computed on some attribute of each record. The function specifies in which block the record should be placed.

Clustering File Organization

Several different relations can be stored in the same file. Related records of the different relations can be stored in the same block.

Data Dictionary Storage

A RDBMS needs to maintain data about the relations, such as the schema. This is stored in a data dictionary (sometimes called a system catalog):

Names of the relations
Names of the attributes of each relation
Domains and lengths of attributes
Names of views, defined on the database, and definitions of those views
Integrity constraints
Names of authorized users
Accounting information about users
Number of tuples in each relation
Method of storage for each relation (clustered/non-clustered)
Name of the index
Name of the relation being indexed
Attributes on which the index in defined
Type of index formed

Lecture 23 Storage and File Structure