Oracle8(TM) Server Tuning
Release 8.0
A54638-01

Library

Product

Contents

Index


Prev Next

19
Tuning Parallel Execution

Parallel execution is capable of dramatically reducing response time for data-intensive operations on very large databases. The first half of this chapter outlines three basic parallel operation tuning steps to get you up and running. The second half provides detailed information to help you diagnose and solve tuning problems.

See Also: Oracle8 Server Concepts, for basic principles of parallel execution.

See your operating system-specific Oracle documentation for more information about tuning while using parallel execution.

Introduction to Parallel Execution Tuning

Parallel execution is useful for operations that access a large amount of data by way of large table scans, large joins, the creation of large indexes, partitioned index scans; bulk inserts, updates, deletes; aggregation or copying. It benefits systems with all of the following characteristics:

If any one of these conditions is not true for your system, parallel execution may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, parallel execution can impede system performance.

The three basic steps for tuning parallel execution are outlined in the following sections.

Note: In this chapter the term "parallel server process" designates a process which is performing a parallel operation, as distinguished from the product "Oracle Parallel Server".

Step 1: Tuning System Parameters for Parallel Execution

Many initialization parameters affect parallel execution performance. For best results, start with an initialization file that is appropriate for the intended application.

Before starting the Oracle Server, set the following initialization parameters. The recommended settings are guidelines for a large data warehouse (more than 100 gigabytes) on a typical high-end shared memory multiprocessor with one or two gigabytes of memory. Each section explains how to modify these settings for other configurations. Note that you can change some of these parameters dynamically with ALTER SYSTEM or ALTER SESSION statements. The parameters are grouped as follows:

Parameters Affecting Resource Consumption

This section discusses the consumption of resources, and in particular the consumption of memory. Tune the parameters discussed in this section to ensure that resource consumption is optimized for your parallel operation needs.

Attention: "Understanding Parallel Execution Performance Issues" on page 19-44 describes in detail how these parameters interrelate.

You must configure memory at two levels:

On some platforms you may need to set operating system parameters which control the total amount of virtual memory available, summed across all processes.

As a general guideline for memory sizing, note that each process needs address space big enough for its hash joins. A dominant factor in heavyweight decision support (DSS) operations is the relationship between memory, number of processes, and number of hash join operations. Since hash joins and large sorts are memory hungry operations, you may want to configure fewer processes, each with a greater limit on the amount of memory it can use.

Memory available for DSS operations comes from process memory, which in turn comes from virtual memory. Total virtual memory should be somewhat more than available real memory, which is the physical memory minus the size of the SGA.

The SGA is static, of fixed size. Typically it comes out of the real physical memory. If you want to change the size of the SGA you must shut down the database, make the change, and restart the database. DSS memory is much more dynamic. It comes out of process memory: and both the size of a process' memory and the number of processes can vary greatly.

Virtual memory is typically more than physical memory, but should not generally exceed twice the size of the physical memory less the SGA size. If you make it many times more than real memory, the paging rate may go up when the machine is overloaded at peak times.

HASH_AREA_SIZE

Recommended Value: Hash area size should be approximately half of the square root of S, where S is the size (in MB) of the smaller of the inputs to the join operation. (The value should not be less than 1 MB.)

This relationship can be expressed as follows:

For example, if S equals 16 MB, a minimum appropriate value for the hash area might be 2 MB, summed over all the parallel processes. Thus if you have 2 parallel processes, a minimum appropriate size might be 1 MB hash area size. A smaller hash area would not be advisable.

For a large data warehouse, HASH_AREA_SIZE may range from 8MB to 32MB or more.

This parameter provides for adequate memory for hash joins. Each process that performs a parallel hash join uses an amount of memory equal to HASH_AREA_SIZE.

Hash join performance is more sensitive to HASH_AREA_SIZE than sort performance is to SORT_AREA_SIZE. As with SORT_AREA_SIZE, too large a hash area may cause the system to run out of memory.

The hash area does not cache blocks in the buffer cache; even low values of HASH_AREA_SIZE will not cause this to occur. Too small a setting could, however, affect performance.

Note that HASH_AREA_SIZE is relevant to parallel query operations, and to the query portion of DML or DDL statements.

See Also: "SORT_AREA_SIZE" on page 19-10

OPTIMIZER_PERCENT_PARALLEL

Recommended Value: 100/number_of_concurrent_users
This parameter determines how aggressively the optimizer will attempt to parallelize a given execution plan. OPTIMIZER_PERCENT_PARALLEL encourages the optimizer to use plans that have low response time because of parallel execution, even if total resource used is not minimized.

The default value of OPTIMIZER_PERCENT_PARALLEL is 0, which parallelizes the plan which uses the least resource, if possible. Here, the execution time of the operation may be long because only a small amount of resource is used. A value of 100 causes the optimizer to choose a parallel plan unless a serial plan would complete faster.

Note: Selecting a single record from a table, if there is an appropriate index, can be done very quickly and does not require parallelism. A full scan to find the single row can be executed in parallel. Normally, however, each parallel process examines many rows. In this case response time of a parallel plan will be higher and total system resource use would be much greater than if it were done by a serial plan using an index. With a parallel plan, the delay is shortened because more resource is used. The parallel plan could use up to D times more resource, where D is the degree of parallelism. A value between 0 and 100 sets an intermediate trade-off between throughput and response time. Low values favor indexes; high values favor table scans.

A FIRST_ROWS hint or optimizer mode will override a non-zero setting of OPTIMIZER_PERCENT_PARALLEL.

PARALLEL_MAX_SERVERS

Recommended Value: 2 * CPUs * number_of_concurrent_users

Most parallel operations need at most twice the number of parallel server processes as the maximum degree of parallelism attributed to any table in the operation. By default this is at most twice the number of CPUs. The following figure illustrates how the recommended value is derived.

Figure 19-1: PARALLEL_MAX_SERVERS = 2 * CPUs * Users

To support concurrent users, add more parallel server processes. You probably want to limit the number of CPU-bound processes to be a small multiple of the number of CPUs (perhaps 4 to 16 times the number of CPUs). This would limit the number of concurrent parallel execution statements to be in the range of 2 to 8.

Note that if a database's users start up too many concurrent operations, Oracle may run out of parallel server processes. Should this happen, Oracle will execute the operation sequentially, or give an error if PARALLEL_MIN_PERCENT is set.

When concurrent users use too many parallel server processes, memory contention (paging), I/O contention, or excessive context switching can occur. This contention could reduce system throughput to a level lower than if no parallel execution were used. Increase the PARALLEL_MAX_SERVERS value only if your system has sufficient memory and I/O bandwidth for the resulting load. Limiting the total number of parallel server processes can restrict the number of concurrent users that can execute parallel operations, but system throughput will tend to remain stable.

To increase the number of concurrent users, you could restrict the number of concurrent sessions that certain classes of user can have. For example:

You can limit the amount of parallelism available to a given user by setting up a resource profile associated with the user. In this way you can limit the number of sessions or concurrent logons, which limits the number of parallel processes the user can have. (Each parallel server process working on your parallel execution statement is logged on as you--it counts against your limit of concurrent sessions.) For example, to limit a user to 10 processes, the DBA would set the user's limit to 11: one process for the parallel coordinator, and ten more processes which would consist of two slave sets. The user's maximum degree of parallelism would thus be 5.

See Also: "The Formula for Memory, Users, and Parallel Server Processes" on page 19-44 for further information on balancing concurrent users, degree of parallelism, and resources consumed.

Oracle8 Server Administrator's Guide for more information about managing resources with user profiles.

PARALLEL_MIN_SERVERS

Recommended Value: PARALLEL_MAX_SERVERS
The system parameter PARALLEL_MIN_SERVERS allows you to specify the number of processes to be started and reserved for parallel operations at startup in a single instance. The syntax is:

PARALLEL_MIN_SERVERS=n

where n is the number of processes you want to start and reserve for parallel operations. Sometimes you may not be able to increase the maximum number of parallel server processes for an instance because the maximum number depends on the capacity of the CPUs and the I/O bandwidth (platform-specific issues). However, if servers are continuously starting and shutting down, you should consider increasing the value of the parameter PARALLEL_MIN_SERVERS.

For example, if you have determined that the maximum number of concurrent parallel server processes that your machine can manage is 100, you should set PARALLEL_MAX_SERVERS to 100. Next determine how many parallel server processes the average operation needs, and how many operations are likely to be executed concurrently. For this example, assume you will have two concurrent operations with 20 as the average degree of parallelism. At any given time, there could be 80 parallel server processes busy on an instance. You should therefore set the parameter PARALLEL_MIN_SERVERS to 80.

Consider decreasing PARALLEL_MIN_SERVERS if fewer parallel server processes than this value are typically busy at any given time. Idle parallel server processes constitute unnecessary system overhead.

Consider increasing PARALLEL_MIN_SERVERS if more parallel server processes than this value are typically active, and the "Servers Started" statistic of V$PQ_SYSSTAT is continuously growing.

The advantage of starting these processes at startup is the reduction of process creation overhead. Note that Oracle reserves memory from the shared pool for these processes; you should therefore add additional memory using the initialization parameter SHARED_POOL_SIZE to compensate. Use the following formula to determine how much memory to add:

(CPUs + 2) * (PARALLEL_MIN_SERVERS) * 1.5 * (BLOCK_SIZE)

SHARED_POOL_SIZE

Recommended Value: default plus
(3 * msgbuffer_size) * (CPUs + 2) * PARALLEL_MAX_SERVERS

Increase the initial value of this parameter to provide space for a pool of message buffers that parallel server processes can use to communicate with each other. As illustrated in the following figure, assuming 4 concurrent users and 2K buffer size, you would increase SHARED_POOL_SIZE by 6K * (CPUs + 2) * PARALLEL_MAX_SERVERS for a pool of message buffers that parallel server processes use to communicate. This value grows quadratically with the degree of parallelism, if you set PARALLEL_MAX_SERVERS to the recommended value. (This is because the recommended values of PARALLEL_MAX_SERVERS and SHARED_POOL_SIZE both are calculated using the square root of the number of CPUS--a quadratic function.)

Figure 19-2: Increasing SHARED_POOL_SIZE with Degree of Parallelism

Note: The message buffer size might be 2K or 4K, depending on the platform. Check your platform vendor's documentation for details.

Parallel plans take up about twice as much space in the SQL area as serial plans, but additional space allocation is probably not necessary since they generally are not shared.

On Oracle Parallel Server, there can be multiple CPUs in a single node, and parallel operation can be performed across nodes. Whereas SMP systems use 3 buffers for connection, 4 buffers are used to connect between instances on Oracle Parallel Server. Thus you should normally have 4 buffers in shared memory: 2 in the local shared pool and 2 in the remote shared pool. The formula for increasing the value of SHARED_POOL_SIZE on OPS becomes:

(4 * msgbuffer_size) * ((CPUs_per_node * #nodes ) + 2) * (PARALLEL_MAX_SERVERS * #nodes)

Note that the degree of parallelism on OPS is expressed by the number of CPUs per node multiplied by the number of nodes.

SORT_AREA_SIZE

Sample Range: 256K to 4M
This parameter specifies the amount of memory to allocate per parallel server for sort operations. If memory is abundant on your system, you can benefit from setting SORT_AREA_SIZE to a large value. This can dramatically increase the performance of sort and hash operations since the entire operation is more likely to be performed in memory. However, if memory is a concern for your system, you may want to limit the amount of memory allocated for sorts and hashing operations and increase the size of the buffer cache so that data blocks from temporary sort segments can be cached in the buffer cache.

If the sort area is too small, an excessive amount of I/O will be required to merge a large number of runs. If the sort area size is smaller than the amount of data to sort, then the sort will spill to disk, creating sort runs. These must then be merged again using the sort area. If the sort area size is very small, there will be many runs to merge, and it may take multiple passes to merge them together. The amount of I/O increases as the sort area size decreases.

If the sort area is too high the operating system paging rate will be excessive. The cumulative sort area adds up fast because each parallel server can allocate this amount of memory for each sort. Monitor the operating system paging rate to see if too much memory is being requested.

Note that SORT_AREA_SIZE is relevant to parallel query operations, and to the query portion of DML or DDL statements. All CREATE INDEX statements must do some sorting to generate the index. These include:

See Also: "HASH_AREA_SIZE" on page 19-4

Parameters Affecting Resource Consumption for PDML and/or PDDL

Parallel INSERT, UPDATE, and DELETE require more resources than do serial DML operations. Likewise, PARALLEL CREATE TABLE AS SELECT and PARALLEL CREATE INDEX may require more resources. For this reason you may need to increase the value of several additional initialization parameters.

See Also: Oracle8 Server SQL Reference for complete information about parameters.

TRANSACTIONS

For PDML, each parallel server process starts a transaction. The parallel coordinator uses the two-phase commit protocol to commit them, therefore the number of transactions being processed will increase by the degree of parallelism. You may thus need to increase the value of the TRANSACTIONS initialization parameter, which specifies the maximum number of concurrent transactions. For example, if you have degree 20 parallelism you will have 20 more new server transactions and 1 coordinator transaction, thus you should increase TRANSACTIONS by 21, if they are running in the same instance.

ROLLBACK_SEGMENTS

The increased number of transactions for PDML necessitates many rollback segments. For example, one command with degree 5 parallelism uses 5 server transactions, which should be distributed among different rollback segments. The rollback segments should belong to tablespaces which have free space, and the rollback segments should be unlimited, or should have a high value for MAXEXTENTS parameter of the STORAGE clause. In this way they can extend and not run out of space.

LOG_BUFFER

Check the statistic "redo buffer allocation retries" in the V$SYSSTAT view. If this value is high, try to increase the LOG_BUFFER size. A common LOG_BUFFER size for a system generating lots of logs is 3 to 5 Mb. If the number of retries is still high after increasing LOG_BUFFER size, there may be a problem with the disk where the log buffers reside. In that case, stripe the log files across multiple disks in order to increase the I/O bandwidth.

Note that PDML will generate a good deal more redo, especially when you are performing inserts, updates and deletes.

DML_LOCKS

A PDML operation's lock and enqueue resource requirement is very different from serial DML. Parallel DML holds many more locks, so you should increase the value of the ENQUEUE_RESOURCES and DML_LOCKS parameters, by the same amount.

The following table shows the types of lock acquired by coordinator and server processes, for different types of PDML statements. Note that a server process can work on one or more partitions, but a partition can only be worked on by one server process (this is different from parallel query).

Table 19-1: Locks Acquired by Parallel DML Statements

Type of statement  

Coordinator
acquires:  

Each server process
acquires:  

Parallel
UPDATE/DELETE into partitioned table; WHERE clause specifies the
partition involved  

1 table lock SX

1 partition lock X, per partition  

1 table lock SX
1 partition lock NULL per partition
1 partition-wait lock X per partition  

Parallel UPDATE/
DELETE/INSERT into partitioned table  

1 table lock SX

partition locks X for all
partitions  

1 table lock SX
1 partition lock NULL per partition
1 partition-wait lock X per partition  

Parallel INSERT into non-partitioned table  

1 table lock X  

none  

Note: Table, partition, and partition-wait DML locks all appear as TM locks in the V$LOCK view.

Consider a table with 600 partitions, running with parallel degree 100, assuming all partitions are involved in the parallel UPDATE/DELETE statement.

The coordinator acquires: 1 table lock SX
600 partition locks X

Total server processes acquire: 100 table locks SX
600 partition locks NULL
600 partition-wait locks X

ENQUEUE_RESOURCES

Since PDML operations require many more resources than serial DML, you should increase the value of the ENQUEUE_RESOURCES and DML_LOCKS parameters, by the same amount.

See Also: "DML_LOCKS" on page 19-12

Parameters Enabling New Features

Set the following parameters in order to use the latest available functionality.

Note: Partitioned tables should be used instead of partition views. Partition views will be obsoleted in a future release of Oracle Server.

ALWAYS_ANTI_JOIN

Recommended Value: HASH
When set to HASH, this parameter causes the NOT IN operator to be evaluated in parallel using a parallel hash anti-join. Without this parameter set to HASH, NOT IN is evaluated as a (sequential) correlated subquery.

Figure 19-3: Parallel Hash Anti-join

As illustrated above, the SQL IN predicate can be evaluated using a join to intersect two sets. Thus emp.deptno can be joined to dept.deptno to yield a list of employees in a set of departments.

Alternatively, the SQL NOT IN predicate can be evaluated using an anti-join to subtract two sets. Thus emp.deptno can be anti-joined to dept.deptno to select all employees who are not in a set of departments. Thus you can get a list of all employees who are not in the Shipping or Receiving departments.

For a specific query, place the MERGE_AJ or HASH_AJ hints into the NOT IN subquery. MERGE_AJ uses a sort-merge anti-join and HASH_AJ uses a hash anti-join. For example:

SELECT * FROM emp

        WHERE ename LIKE 'J%' AND

                        deptno IS NOT NULL AND

                        deptno NOT IN (SELECT /*+ HASH_AJ */ deptno FROM dept

                                        WHERE deptno IS NOT NULL AND

                                        loc = 'DALLAS');

If you wish the anti-join transformation always to occur if the conditions in the previous section are met, set the ALWAYS_ANTI_JOIN initialization parameter to MERGE or HASH. The transformation to the corresponding anti-join type then takes place whenever possible.

COMPATIBLE

Sample Value: 8.0.0
This parameter enables new features that may prevent you from falling back to an earlier release. To be sure that you are getting the full benefit of the latest performance features, set this parameter equal to the current release.

Note: Make a full backup before you change the value of this parameter.

Parameters Related to I/O

Tune the following parameters to ensure that I/O operations are optimized for parallel execution.

DB_BLOCK_BUFFERS

When you perform parallel updates and deletes, the buffer cache behavior is very similar to any system running a high volume of updates. For more information see "Tuning the Buffer Cache" on page 14-28

DB_BLOCK_SIZE

Recommended Value: 8K or 16K
The database block size must be set when the database is created. If you are creating a new database, use a large block size.

DB_FILE_MULTIBLOCK_READ_COUNT

Recommended Value: 8, for 8K block size; 4, for 16K block size
This parameter determines how many database blocks are read with a single operating system READ call. Many platforms limit the number of bytes read to 64K, limiting the effective maximum for an 8K block size to 8. Other platforms have a higher limit. For most applications, 64K is acceptable.

In general, use the formula:
DB_FILE_MULTIBLOCK_READ_COUNT = 64K/DB_BLOCK_SIZE.

HASH_MULTIBLOCK_IO_COUNT

Recommended Value: 4
Increasing the value of HASH_MULTIBLOCK_IO_COUNT decreases the number of hash buckets. If a system is I/O bound, you can increase the efficiency of I/O by having larger transfers per I/O.

Because memory for I/O buffers comes from the HASH_AREA_SIZE, larger I/O buffers mean fewer hash buckets. There is a trade-off, however. For large tables (hundreds of gigabytes in size) it is better to have more hash buckets and slightly less efficient I/Os. If you find an I/O bound condition on temporary space during hash join, consider increasing this value.

SORT_DIRECT_WRITES

Recommended Value: AUTO
When this parameter is set to AUTO and SORT_AREA_SIZE is greater than 10 times the buffer size, this parameter causes the buffer cache to be bypassed for the writing of sort runs.

Reading through the buffer cache may result in greater path length, excessive memory bus utilization, and LRU latch contention on SMPs. Avoiding the buffer cache can thus provide performance improvement by a factor of 3 or more. It also removes the need to tune buffer cache and DBWR parameters.

If paging is high, it is a symptom that the relationship of memory, users, and parallel server processes is out of balance. To rebalance it, you can reduce the sort or hash area size. You can limit the amount of memory for sorts if SORT_DIRECT_WRITES is set to AUTO but the SORT_AREA_SIZE is small. Then sort blocks will be cached in the buffer cache. Note that SORT_DIRECT_WRITES has no effect on hashing.

See Also: "HASH_AREA_SIZE" on page 19-4

SORT_READ_FAC

Recommended Value: depends on disk speed
This parameter sets the number of blocks read with a single operating system read call, from a temporary tablespace during a sort.

See Also: Oracle8 Server Reference Manual for more information about setting this parameter.

DISK_ASYNC_IO and TAPE_ASYNC_IO

Recommended Value: TRUE
These parameters turn on or off the use of the operating system's asynchronous I/O facility. They allow parallel server processes to overlap
I/O requests with processing when performing table scans. If the operating system supports asynchronous I/O, these parameters should be left at the default value of TRUE.

Figure 19-4: Asynchronous Read

Asynchronous operations are currently supported with parallel table scans and hash joins only. They are not supported for sorts, or for serial table scans. In addition, this feature may require operating system specific configuration and may not be supported on all platforms. Check your Oracle platform-specific documentation.

Note: Asynchronous I/O behavior, if not natively available, can be simulated via the deployment of I/O server processes using the parameters DBWR_IO_SLAVES, LGWR_IO_SLAVES, BACKUP_DISK_IO_SLAVES, and BACKUP_TAPE_IO_SLAVES. Deployment of these I/O servers are orthogonal to the availability of asynchronous I/O from the platform. Although I/O server processes can be deployed even when asynchronous
I/O is available, this is not a recommended practice.

Step 2: Tuning Physical Database Layout for Parallel Execution

This section describes how to tune the physical database layout for optimal performance of parallel execution. To this end, it presents a case study showing how to prepare a simple database for parallel execution.

Types of Parallelism

Different parallel operations use different types of parallelism. The physical database layout you choose can depend on what parallel operations are most prevalent in your application.

The basic unit of parallelism is a called a granule. The operation being parallelized (a table scan, table update, or index creation, for example) is divided by Oracle into granules. Parallel server processes execute the operation on one granule at a time. The number of granules and their size affect the degree of parallelism which can be used, and how well the work is balanced across parallel server processes.

Block Range Granules

Block range granules are used for most parallel query operations. Block range granules are ranges of physical blocks from the table. Because they are based on physical data addresses, block range granules can be sized to allow better load balancing. Block range granules permit dynamic parallelism that does not depend on static preallocation of tables or indexes. On many MPP systems block range granules are assigned to parallel server processes which have physical proximity to the disks storing the granules. Block range granules are used with global striping.

A slightly different type of block range granules are also used for the insert portion of parallel direct-load inserts on non-partitioned tables, the create table portion of parallel table creation for non-partitioned tables, for the index creation portion of the creation of non-partitioned indexes, and for the index creation portion of rebuilds of index partitions.

See Also: For MPP systems, see your platform-specific documentation

Partition Granules

Partition granules are used for parallel index range scans and for parallel operations that modify multiple partitions of a partitioned table or index. These operations include parallel update, parallel delete, parallel direct-load insert into partitioned tables, parallel creation of partitioned indexes, and parallel creation of partitioned tables. Partition granules can be used with either local or global striping.

When partition granules are used, a parallel server process works on an entire partition of a table or index. Because partition granules are statically determined when a table or index is created, partition granules do not allow as much flexibility in parallelizing an operation. This means that the degree of parallelism possible might be limited, and that load might not be well balanced across parallel server processes.

When partition granules are used for parallel access to a table or index, it is important that there be a relatively large number of partitions (at least three times the degree of parallelism), so that work can be balanced across the parallel server processes. It is also important that tables or indexes parallelized by partition have their partitions on independent disk drives so that concurrent accesses to different partitions do not interfere and cause contention.

When block range granules are predominantly used for parallel access to a table or index, administrative considerations such as recovery, or using partitions for deleting portions of data, may influence partition layout more than performance considerations. If parallel execution operations frequently take advantage of partition pruning, it is important that the set of partitions accessed by the query be striped over at least as many disks as the degree of parallelism.

See Also: Oracle8 Server Concepts for information on disk striping and partitioning.

A Case Study

The case study in this chapter illustrates how to create, load, index, and analyze a large fact table with partitions, in a typical star schema. The example 120 GB table is named "facts". This case study assumes a 10 CPU shared memory computer with more than 100 disk drives. Thirty 4GB disks will be used for base table data, 10 disks for index, and 30 disks for temporary space. Additional disks are needed for rollback segments, control files, log files, possible staging area for loader flat files, and so on. The facts table will be partitioned by month into 12 logical partitions. Each partition will be spread evenly over 10 disks, so that a scan which accesses few partitions, or a single partition, can still proceed with full parallelism.

Striping Data

To avoid I/O bottlenecks, you should stripe all tablespaces accessed in parallel over at least as many disks as the degree of parallelism. Stripe over at least as many devices as CPUs. This includes tablespaces for tables, tablespaces for indexes, and temporary tablespaces. You must also spread the devices over controllers, I/O channels, and/or internal busses.

Figure 19-5: Striping Objects Over at Least as Many Devices as CPUs

To stripe data during load, use the FILE= clause of parallel loader to load data from multiple load sessions into different files in the tablespace. For any striping to be effective, you must ensure that there are enough controllers and other I/O components to support the bandwidth of parallel data movement into and out of the striped tablespaces.

The operating system or volume manager can perform striping (OS striping), or you can manually perform striping for parallel operations.

Operating System Striping

Operating system striping is usually flexible and easy to manage. It supports multiple users running sequentially as well as single users running in parallel.

Good stripe sizes for table data are 128K, 256K, 1MB, and 5MB, depending on block size and size of the data object. For example, with a medium size database (perhaps 20G), 1MB is a good stripe size. On a very large database (over 100G), 5MB tends to be the best stripe size. These recommended sizes represent a compromise between the requirements of operation performance, backup and restore performance, and load balancing. Setting the stripe size too small will detract from performance, particularly for backup and restore operations.

See Also: Oracle8 Server Concepts for information on disk striping and partitioning.

For MPP systems, see your platform-specific Oracle documentation regarding the advisability of disabling disk affinity when using operating system striping.

Manual Striping

Manual striping can be done on all platforms. This requires more DBA planning and effort to set up, and may yield better performance if only a single operation is running. This may not be the case, however, if many operations are running.

For manual striping add multiple files, each on a separate disk, to each tablespace. However, if each partition is on a separate disk, this is already a form of striping.

Local and Global Striping

Local striping, which applies only to partitioned tables and indexes, is a form of non-overlapping disk-to-partition striping. Each partition has its own set of disks and files, as illustrated in Figure 19-6. There is no overlapping disk access, and no overlapping files.

Advantages of local striping are that if one disk fails it will not affect other partitions, and that you still have some striping even if you have data in only one partition.

A disadvantage of local striping is that you need many more disks to implement it--each partition requires a few disks of its own.

Figure 19-6: .Local Striping

Global striping, illustrated in Figure 19-7, does involve overlapping disks and partitions.

Global striping is advantageous if you have partition pruning and only need to access data in one partition. Spreading the data across many disks improves performance for parallel query operations. A disadvantage of global striping is that if one disk fails, all partitions are affected.

Figure 19-7: Global Striping

How to Analyze Striping

You may wish to stripe an object across devices for the sake of one of three goals:

To attain both Goal 1 and Goal 2 (having the table reside on many devices, with the highest possible availability) you can maximize the number of partitions (p) and minimize the number of partitions per tablespace (s).

For highest availability, but least intra-partition parallelism, place each partition in its own tablespace; do not used striped files; and use one file per tablespace. To minimize #2, set f and n equal to 1.

Notice that the solution that minimizes availability maximizes intra-partition parallelism. Notice that Goal 3 conflicts with Goal 2 because you cannot simultaneously maximize the formula for Goal 3 and minimize the formula for Goal 2. You must compromise if you are interested in both goals.

To analyze striping, consider the following relationships:

Figure 19-8: Cardinality of Relationships

Figure 19-8 shows the cardinality of the relationships between objects in the storage system. For every table there may be p partitions; there may be s partitions for every tablespace, f files for every tablespace, and m files to n devices (a many-to-many relationship).

Goal 1: To have a table on many devices. This goal is beneficial because full table scans are scalable.

Calculate the number of partitions multiplied by the number of files in the tablespace multiplied by the number of devices per file. Divide this product by the number of partitions that share the same tablespace, multiplied by the number of files that share the same device. The formula is as follows:

You can do this by having t partitions, with every partition in its own tablespace, if every tablespace has one file, and these files are not striped.

t x 1/p x 1 x 1, up to t devices

If the table is not partitioned, but is in one tablespace, one file, it should be striped over n devices.

1 x 1 x n

Maximum t partitions, every partition in its own tablespace, f files in each tablespace, each tablespace on striped device:

t x f x n devices

Goal 2: To achieve high availability. This translates into having the tablespace on few devices.

Availability is maximized when:

f = n = m = 1

and p is much greater than 1

Goal 3: To achieve intra-partition parallelism. This is beneficial because partition scans are scalable. To achieve this, place each partition on many devices.

Partitions can reside in a tablespace that can have many files. There could be either

Striping and Media Recovery

Striping affects media recovery. Loss of a disk usually means loss of access to all objects that were stored on that disk. If all objects are striped over all disks, then loss of any disk takes down the entire database. Furthermore, all database files may have to be restored from backups, even if each file has only a small fraction actually stored on the failed disk.

Often, the same OS subsystem that provides striping also provides mirroring. With the declining price of disks, mirroring can provide an effective supplement to backups and log archival--but not a substitute for them. Mirroring can help your system recover from a device failure more quickly than with a backup, but is not as robust. Mirroring does not protect against software faults and other problems that an independent backup would protect your system against. Mirroring can be used effectively if you are able to reload read-only data from the original source tapes. If you do have a disk failure, restoring the data from the backup could involve lengthy downtime, whereas restoring it from a mirrored disk would enable your system to get back online quickly.

Even cheaper than mirroring is RAID technology, which avoids full duplication in favor of more expensive write operations. For read-mostly applications, this may suffice.

Note: RAID5 technology is particularly slow on write operations. This may affect your database restore time to a point that RAID5 performance is unacceptable.

See Also: For a discussion of manually striping tables across datafiles, refer to "Striping Disks" on page 15-23.

For a discussion of media recovery issues, see "Backup and Recovery of the Data Warehouse" on page 6-8.

For more information about automatic file striping and tools to use to determine I/O distribution among your devices, refer to your operating system documentation.

Partitioning Data

Partitioned tables and indexes can improve the performance of operations in a data warehouse. Partitioned tables and indexes allow at least the same parallelism as non-partitioned tables and indexes. In addition, partitions of a table can be pruned based on predicates and values in the partitioning column. Range scans on partitioned indexes can be parallelized, and insert, update and delete operations can be parallelized.

Partitioned tables and indexes facilitate administrative operations by allowing them to operate on subsets of data. For example, a new partition can be added, an existing partition can be reorganized, or an old partition can be dropped with less than a second of interruption to a read-only application.

When to Use Partitioned Tables in a Data Warehouse

Consider using partitioned tables in a data warehouse when:

Partitioned Table Trade-offs

The case study in this chapter illustrates how to use partitioned tables for maximum operation performance, not necessarily for minimum downtime in the event of disk failure.

Here, each partition is spread across one third of the disks in the tablespace so that loss of a single disk causes 4 out of 12 partitions to become unavailable. Alternatively, partitions may be assigned to disks such that a disk failure takes out a single partition, and surviving partitions remain available. The trade-off is that operations against a single partition may not scale due to limited I/O bandwidth.

To avoid I/O bottlenecks when not all partitions are being scanned (because some have been eliminated), each partition should be spread over a number of devices. On parallel server systems, those devices should be spread over multiple nodes.

See Also: Chapter 19, "Tuning Parallel Execution"

Your operating system specific Oracle documentation.

Determining the Degree of Parallelism

When dealing with degree of parallelism, you need to distinguish between two types of granularity: block range granules and partition granules.

If the data being accessed by a parallel operation (after partition pruning is applied) is spread over at least as many disks as the degree of parallelism, then most operations will be CPU-bound and a degree of parallelism equal to the number of CPUs is appropriate. A lower degree of parallelism may be appropriate if two or more concurrent parallel operations frequently access the same data. Operations that tend to be I/O bandwidth bound can benefit from a higher degree of parallelism, especially if the data is spread over more disks. Operations that tend to be I/O bandwidth bound are:

Parallel operations which perform random I/O access (such as index maintenance of parallel update and delete operations) can saturate the I/O subsystem with a high number of I/Os, very much like an OLTP system with high concurrency. To ease this I/O problem, the data should be spread among more devices and disk controllers. Increasing the degree of parallelism will not help.

Oracle automatically computes the default parallel degree of a table as the minimum of the number of disks storing the table and the number of CPUs available. If, as recommended, you have spread objects over at least as many disks as you have CPUs, the default parallelism will always be the number of CPUs. Warehouse operations are typically CPU bound; thus the default is a good choice, especially if you are using the asynchronous readahead feature. Consider explicitly setting the parallel degree to 2 * CPUs if you are performing synchronous reads, or if you have very fast CPUs and slower
I/O. Consider reducing parallelism for objects that are frequently accessed by two or more concurrent parallel operations.

Should you find that some operations are I/O bound with the default parallelism, and you have more disks than CPUs, override the usual parallelism with a hint that increases parallelism up to the number of disks, or until the CPUs become saturated.

Using Parallel Load

This section describes how to load a large table, such as the FACT table in a decision support database. This example uses SQL Loader to explicitly stripe data over 30 disks.

The table is partitioned by date into twelve partitions. To facilitate backup and recovery each partition is stored in its own tablespace. To allow intra-partition parallelism when queries restrict data access by partition pruning, each tablespace is striped across 10 of the 30 disks. Each disk has been further subdivided using an OS utility into 4 OS files with names like /dev/D1.1, /dev/D1.2, ... , /dev/D30.4.

There are four tablespaces allocated on each group of 10 disks. To better balance I/O and parallelize table space creation (because Oracle writes each block in a datafile when it is added to a tablespace), it is best if each of the four tablespaces on each group of 10 disks has its first datafile on a different disk. Thus the first tablespace has /dev/D1.1 as its first datafile, the second tablespace has /dev/D4.2 as its first datafile, and so on.

Figure 19-9: Datafile Layout for Parallel Load Example

Step 1: Create the Tablespaces and Add Datafiles in Parallel

Below is the command to create a tablespace named "Tsfacts1". Other tablespaces are created with analogous commands. On a 10-CPU machine, it should be possible to run all 12 CREATE TABLESPACE commands together. Alternatively, it might be better to run them in two batches of 6 (two from each of the three groups of disks).

CREATE TABLESPACE Tsfacts1 

DATAFILE /dev/D1.1'  SIZE 1024MB REUSE

DATAFILE /dev/D2.1'  SIZE 1024MB REUSE

DATAFILE /dev/D3.1'  SIZE 1024MB REUSE

DATAFILE /dev/D4.1' SIZE 1024MB REUSE

DATAFILE /dev/D5.1'  SIZE 1024MB REUSE

DATAFILE /dev/D6.1'  SIZE 1024MB REUSE

DATAFILE /dev/D7.1'  SIZE 1024MB REUSE

DATAFILE /dev/D8.1' SIZE 1024MB REUSE

DATAFILE /dev/D9.1'  SIZE 1024MB REUSE

DATAFILE /dev/D10.1  SIZE 1024MB REUSE

DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)
CREATE TABLESPACE Tsfacts2

DATAFILE /dev/D4.2' SIZE 1024MB REUSE

DATAFILE /dev/D5.2'  SIZE 1024MB REUSE

DATAFILE /dev/D6.2'  SIZE 1024MB REUSE

DATAFILE /dev/D7.2'  SIZE 1024MB REUSE

DATAFILE /dev/D8.2' SIZE 1024MB REUSE

DATAFILE /dev/D9.2'  SIZE 1024MB REUSE

DATAFILE /dev/D10.2  SIZE 1024MB REUSE

DATAFILE /dev/D1.2'  SIZE 1024MB REUSE

DATAFILE /dev/D2.2'  SIZE 1024MB REUSE

DATAFILE /dev/D3.2'  SIZE 1024MB REUSE

DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)

...
CREATE TABLESPACE Tsfacts4

DATAFILE /dev/D10.4'  SIZE 1024MB REUSE

DATAFILE /dev/D1.4'  SIZE 1024MB REUSE

DATAFILE /dev/D2.4'  SIZE 1024MB REUSE

DATAFILE /dev/D3.4  SIZE 1024MB REUSE

DATAFILE /dev/D4.4' SIZE 1024MB REUSE

DATAFILE /dev/D5.4'  SIZE 1024MB REUSE

DATAFILE /dev/D6.4'  SIZE 1024MB REUSE

DATAFILE /dev/D7.4'  SIZE 1024MB REUSE

DATAFILE /dev/D8.4' SIZE 1024MB REUSE

DATAFILE /dev/D9.4'  SIZE 1024MB REUSE

DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)

...
CREATE TABLESPACE Tsfacts12

DATAFILE /dev/D30.4'  SIZE 1024MB REUSE

DATAFILE /dev/D21.4'  SIZE 1024MB REUSE

DATAFILE /dev/D22.4'  SIZE 1024MB REUSE

DATAFILE /dev/D23.4  SIZE 1024MB REUSE

DATAFILE /dev/D24.4' SIZE 1024MB REUSE

DATAFILE /dev/D25.4'  SIZE 1024MB REUSE

DATAFILE /dev/D26.4'  SIZE 1024MB REUSE

DATAFILE /dev/D27.4'  SIZE 1024MB REUSE

DATAFILE /dev/D28.4' SIZE 1024MB REUSE

DATAFILE /dev/D29.4'  SIZE 1024MB REUSE

DEFAULT STORAGE (INITIAL 100MB NEXT 100MB PCTINCREASE 0)

Extent sizes should be multiples of the multiblock read size, where

blocksize * MULTIBLOCK_READ_COUNT = multiblock read size

Note that INITIAL and NEXT should normally be set to the same value. In the case of parallel load, make the extent size large enough to keep the number of extents reasonable, and to avoid excessive overhead and serialization due to data dictionary bottlenecks. When PARALLEL=TRUE is used for parallel loader, the INITIAL extent is not used. In this case you can override the INITIAL extent size specified in the tablespace default storage clause with the value that you specify in the loader control file (such as, for example, 64K).

Tables or indexes can have an unlimited number of extents provided you have set the COMPATIBLE system parameter and use the MAXEXTENTS keyword on the CREATE or ALTER command for the tablespace or object. In practice, however, a limit of 10,000 extents per object is reasonable. Since a table or index has an unlimited number of extents, the PERCENT_INCREASE parameter should be set to zero in order to have extents of equal size.

Note: It is not desirable to allocate extents faster than about 2 or 3 per minute. See "ST Enqueue" on page 19-54 for more information. Thus, each process should get an extent that will last for 3-5 minutes. That is at least 50MB. The largest possible extent size for a 4GB disk evenly divided into 4 partitions is 1GB. 100MB extents should work nicely. Each partition will have 100 extents. The default storage parameters can be customized for each object created in the tablespace, if needed.

Step 2: Create the Partitioned Table

We create a partitioned table with 12 partitions, each in its own tablespace. The table contains multiple dimensions and multiple measures. The partitioning column is named "dim_2" and is a date. There are other columns as well.

CREATE TABLE fact (dim_1 NUMBER, dim_2 DATE, ...

        meas_1 NUMBER, meas_2 NUMBER, ... )

        PARALLEL

        (PARTITION BY RANGE (dim_2)

        PARTITION jan95 VALUES LESS THAN (`02-01-1995') TABLESPACE

                TSfacts1

        PARTITION feb95 VALUES LESS THAN (`03-01-1995') TABLESPACE

                TSfacts2

        ...

        PARTITION dec95 VALUES LESS THAN (`01-01-1996') TABLESPACE

                TSfacts12)

;

Step 3: Load the Partitions in Parallel

This section describes four alternative approaches to loading partitions in parallel.

The different approaches to loading help you manage the ramifications of the PARALLEL=TRUE keyword of SQL*Loader, which controls whether or not individual partitions are loaded in parallel. The PARALLEL keyword entails restrictions such as the following:

Note, however, that regardless of the setting of this keyword, if you have one loader process per partition, you are still effectively loading into the table in parallel,

Case 1

In this approach, assume 12 input files that are partitioned in the same way as your table. The DBA has 1 input file per partition of the table to be loaded. The DBA starts 12 SQL*Loader sessions in parallel, entering statements like these:

SQLLDR DATA=jan95.dat DIRECT=TRUE CONTROL=jan95.ctl

SQLLDR DATA=feb95.dat DIRECT=TRUE CONTROL=feb95.ctl

. . . 

SQLLDR DATA=dec95.dat DIRECT=TRUE CONTROL=dec95.ctl

Note that the keyword PARALLEL=TRUE is not set. A separate control file per partition is necessary because the control file must specify the partition into which the loading should be done. It contains a statement such as:

LOAD INTO fact partition(jan95)

Advantages of this approach are that local indexes will be maintained by SQL*Loader. You still get parallel loading, but on a partition level--without the restrictions of the PARALLEL keyword.

A disadvantage is that you must partition the input manually.

Case 2

In another common approach, assume that there are an arbitrary number of input files, which are not partitioned in the same way as the table. The DBA can adopt a strategy of performing parallel load for each input file individually. Thus if there are 7 input files, the DBA can start 7 SQL*Loader sessions, using statements like the following:

SQLLDR DATA=file1.dat DIRECT=TRUE PARALLEL=TRUE

Oracle will partition the input data so that it goes into the correct partitions. In this case all the loader sessions can share the same control file, so there is no need to mention it in the statement.

The keyword PARALLEL=TRUE must be used because each of the 7 loader sessions can write into every partition. (In case 1, every loader session would write into only 1 partition because the data was already partitioned outside Oracle.) Hence all the PARALLEL keyword restrictions are in effect.

In this case Oracle will attempt to spread the data evenly across all the files in each of the 12 tablespaces--however an even spread of data is not guaranteed. Moreover, there could be I/O contention during the load when the loader processes are attempting to simultaneously write to the same device.

Case 3

Case 3 is illustrated in the example. In this case the DBA wants precise control of the load. To achieve this the DBA must partition the input data in the same way as the datafiles are partitioned in Oracle.

This example uses 10 processes loading into 30 disks. To accomplish this, the DBA must split the input into 120 files beforehand. The 10 processes will load the first partition in parallel on the first 10 disks, then the second partition in parallel on the second 10 disks, and so on through the twelfth partition. The DBA runs the following commands concurrently as background processes:

SQLLDR DATA=jan95.file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D1.1 

...

SQLLDR DATA=jan95.file10.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D10.1 

WAIT;

...

SQLLDR DATA=dec95.file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D30.4

...

SQLLDR DATA=dec95.file10.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D29.4 

For Oracle Parallel Server, divide the loader session evenly among the nodes. The data file being read should always reside on the same node as the loader session. NFS mount of the data file on a remote node is not an optimal approach.

The keyword PARALLEL=TRUE must be used because multiple loader sessions can write into the same partition. Hence all the restrictions entailed by the PARALLEL keyword are in effect. An advantage of this approach, however, is that it guarantees that all of the data will be precisely balanced, exactly reflecting your partitioning.

Note: Although this example shows parallel load used with partitioned tables, the two features can be used independent of one another.

Case 4

For you to use this approach, all of your partitions must be in the same tablespace. You need to have the same number of input files as datafiles in the tablespace, but you do not need to partition the input the same way in which the table is partitioned.

For example, if all 30 devices were in the same tablespace, then you would arbitrarily partition your input data into 30 files, then start 30 SQL*Loader sessions in parallel. The statement starting up the first session would be like the following:

SQLLDR DATA=file1.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D1

. . .

SQLLDR DATA=file30.dat DIRECT=TRUE PARALLEL=TRUE FILE=/dev/D30

The advantage of this approach is that, like Case 3, you have control over the exact placement of datafiles, since you use the FILE keyword. However, you are not required to partition the input data by value: Oracle does that.

A disadvantage is that it requires all the partitions to be in the same tablespace; this minimizes availability.

Setting Up Temporary Tablespaces for Parallel Sort and Hash Join

For optimal space management performance you can use dedicated temporary tablespaces. As with the TSfacts tablespace, we first add a single datafile and later add the remainder in parallel.

CREATE TABLESPACE TStemp TEMPORARY DATAFILE '/dev/D31' 

        SIZE 4096MB REUSE

DEFAULT STORAGE (INITIAL 10MB NEXT 10MB PCTINCREASE 0);

Size of Temporary Extents

Temporary extents are all the same size, since the server ignores the PCTINCREASE and INITIAL settings and only uses the NEXT setting for temporary extents. This helps to avoid fragmentation. As a general rule, temporary extents should be smaller than permanent extents, and in the range of 1MB to 10MB.

Temporary extents should be smaller than permanent extents because there are more demands for temporary space, and parallel processes or other operations running concurrently must share the temporary tablespace. Once you allocate an extent it is yours for the duration of your operation. If you allocate a large extent but only need to use a small amount of space, the unused space in the extent is tied up.

At the same time, temporary extents should be large enough that processes do not have to spend all their time waiting for space. Although temporary tablespaces use less overhead than permanent tablespaces when allocating and freeing a new extent, obtaining a new temporary extent is not completely free of overhead. There is still the overhead of acquiring a latch and searching through the SGA structures, as well as SGA space consumption for the sort extent pool. Also, if extents are too small, SMON may take a long time dropping old sort segments when instances come up.

Operating System Striping of Temporary Tablespaces

Operating system striping is an alternative technique you can use with temporary tablespaces. Media recovery, however, offers subtle challenges for large temporary tablespaces. It does not make sense to mirror, use RAID, or back up a temporary tablespace. If you lose a disk in an OS striped temporary space, you will probably have to drop and recreate the tablespace. This could take several hours for our 120 GB example. With Oracle striping, simply remove the bad disk from the tablespace. For example, if /dev/D50 fails, enter:

ALTER DATABASE DATAFILE `/dev/D50' RESIZE 1K;

ALTER DATABASE DATAFILE `/dev/D50' OFFLINE;

Because the dictionary sees the size as 1K, which is less than the extent size, the bad file will never be accessed. Eventually, you may wish to recreate the tablespace.

Be sure to make your temporary tablespace available for use:

ALTER USER scott TEMPORARY TABLESPACE TStemp; 

See Also: For MPP systems, see your platform-specific documentation regarding the advisability of disabling disk affinity when using operating system striping.

Creating Indexes in Parallel

Indexes on the fact table can be partitioned or non-partitioned. Local partitioned indexes provide the simplest administration. The only disadvantage is that a search of a local non-prefixed index requires searching all index partitions.

The considerations for creating index tablespaces are similar to those for creating other tablespace. Operating system striping with a small stripe width is often a good choice, but to simplify administration it is best to use a separate tablespace for each index. If it is a local index you may want to place it into the same tablespace as the partition to which it corresponds. If each partition is striped over a number of disks, the individual index partitions can be rebuilt in parallel for recovery. Alternatively, operating system mirroring can be used. For these reasons the NOLOGGING option of the index creation statement may be attractive for a data warehouse.

Tablespaces for partitioned indexes should be created in parallel in the same manner as tablespaces for partitioned tables.

Partitioned indexes are created in parallel using partition granules, so the maximum degree of parallelism possible is the number of granules. Local index creation has less inherent parallelism than global index creation, and so may run faster if a higher degree of parallelism is used. The following statement could be used to create a local index on the fact table.

CREATE INDEX I on fact(dim_1,dim_2,dim_3) LOCAL 

PARTITION jan95 TABLESPACE Tsidx1, 

PARTITION feb95 TABLESPACE Tsidx2, 

... 

PARALLEL(DEGREE 12) NOLOGGING

To back up or restore January data, you need only manage tablespace Tsidx1.

See Also: Oracle8 Server Concepts for a discussion of partitioned indexes.

Additional Considerations for Parallel DML

When parallel insert, update, or delete are to be performed on a data warehouse, some additional considerations are needed when designing the physical database. This section covers:

Using Local and Global Striping

Parallel DML works mostly on partitioned tables. It does not use asynchronous I/O and may generate a high number of random I/O requests during index maintenance of parallel update and delete operations. Local striping is most efficient in reducing I/O contention of local index maintenance, because one server process will only go to its own set of disks and disk controllers. Local striping also increases availability in the event of one disk failing.

For global indexes (partitioned or non-partitioned), the best solution is to globally stripe the index across many disks and disk controllers to distribute the number of I/Os.

Increasing INITRANS and MAXTRANS

If you have global indexes, a global index segment and global index blocks will be shared by server processes of the same PDML statement. Even if the operations are not performed against the same row, the server processes may share the same index blocks. Each server transaction needs one transaction entry in the index block header before it can make changes to a block. Therefore you should set INITRANS (the initial number of transactions allocated within each data block) to a large value, such as the maximum degree of parallelism against this index. Leave MAXTRANS, the maximum number of concurrent transactions that can update a data block, at its default value, which is the maximum your system can support (not to exceed 255).

If you run degree 10 parallelism against a table with a global index, all 10 server processes might attempt to change the same global index block. For this reason you must set MAXTRANS to at least 10 so that the server processes can all make the change at the same time. If MAXTRANS is not large enough, the PDML operation will fail.

Limitation on Available Number of Transaction Free Lists

For UPDATE and DELETE operations, each server process may require its own transaction free list. The PDML degree of parallelism is thus effectively limited by the smallest number of transaction free lists available on any of the global indexes which the DML statement must maintain. Thus, if there are two global indexes, one with 50 transaction free lists and one with 30 transaction free lists, the degree of parallelism is limited to 30.

Once the segment has been created, the number of process and transaction free lists is fixed and cannot be altered. If you specify a large number of process free lists in the segment header, you may find that this limits the number of transaction free lists that are available. You can abate this limitation the next time you recreate the segment header by decreasing the number of process free lists; this will leave more room for transaction free lists in the segment header.

Note that the FREELISTS parameter of the STORAGE clause is used to set the number of process free lists. By default, no process free lists are created.

The default number of transaction free lists depends on the block size. For example, if the number of process free lists is not set explicitly, a 4K block has about 80 transaction free lists by default. The minimum number of transaction free lists is 25.

Using Multiple Archivers

PDDL and PDML operations may generate a large amount of redo logs. A single ARCH process to archive these redo logs might not be able to keep up. To avoid this problem, you can spawn multiple archiver processes. This can be done manually, or by using a job queue.

DBWR Workload

Parallel DML operations dirty a high number of data, index, and undo blocks in the buffer cache during a short period of time. If you see a high number of "free_buffer_waits" in the V$SYSSTAT view, you should tune the DBWR process.

See Also: "Tuning the Redo Log Buffer" on page 14-8

[NO]LOGGING Option

The [NO]LOGGING option applies to tables, partitions, tablespaces, and indexes. Virtually no log is generated for certain operations (such as direct-load insert) if the NOLOGGING option is used. The NOLOGGING attribute is not specified at the INSERT statement level, but is instead specified when using the ALTER or CREATE command for the table, partition, index, or tablespace.

For INSERT, when a table or index has NOLOGGING set, neither parallel nor serial direct-load insert operations generate undo or redo logs. Processes running with the NOLOGGING option set run faster because no redo is generated. However, after a NOLOGGING operation against a table, partition, or index, if there is a media failure before a backup is taken, then all tables, partitions, and indexes that have been modified may be subject to corruption.

Note: Direct-load insert operations (except for dictionary updates) never generate undo logs. The NOLOGGING attribute does not affect undo, but only redo. To be precise, NOLOGGING allows the direct-load insert operation to generate a negligible amount of redo (range-invalidation redo, as opposed to full image redo).

For backward compatibility, [UN]RECOVERABLE is still supported as an alternate keyword with the CREATE TABLE statement in Oracle8 Server, release 8.0. This alternate keyword may not be supported, however, in future releases.

The tablespace level logging clause specifies the default logging attribute for all tables, indexes, and partitions created in the tablespace. When an existing tablespace logging attribute is changed by the ALTER TABLESPACE statement, then all tables, indexes, and partitions created after the ALTER statement will have the new logging attribute; existing ones will not change their logging attributes. The tablespace level logging attribute can be overridden by the specifications at the level of the table, index, or partition).

The default logging attribute is LOGGING. However, if the database is in NOARCHIVELOG mode (by issuing ALTER DATABASE NOARCHIVELOG), then all operations that can be done without logging will not generate logs, regardless of the specified logging attribute.

See Also: Oracle8 Server SQL Reference

Step 3: Analyzing Data

After the data is loaded and indexed, analyze it. It is very important to analyze the data after any DDL changes or major DML changes. The ANALYZE command does not execute in parallel against a single table or partition. However, many different partitions of a partitioned table can be analyzed in parallel. The stored procedure DBMS_UTILITY.ANALYZE_PART_OBJECT can be used to submit jobs to a job queue in order to analyze a partitioned table in parallel.

Note: Cost-based optimization is always used with parallel execution and with partitioned tables. You must therefore perform ANALYZE at the partition level with partitioned tables and with parallel query

Queries with many joins are quite sensitive to the accuracy of the statistics. Use the COMPUTE option of the ANALYZE command if possible (it may take quite some time and a large amount of temporary space). If you must use the ESTIMATE option, sample as large a percentage as possible (for example, 10%). Use histograms for data which is not uniformly distributed. Note that a great deal of data falls into this classification.

When you analyze a table, the indexes that are defined on that table are also analyzed. There are two ways to analyze all partitions of facts (including indexes) in parallel:

It is worthwhile computing or estimating with a larger sample size the indexed columns and indexes themselves, rather than the measure data. The measure data is not used as much: most of the predicates and critical optimizer information comes from the dimensions. A DBA or application designer should know which columns are the most frequently used in predicates.

For example, you might analyze the data in two passes. In the first pass you could obtain some statistics by analyzing one percent of the data. Run the following command to submit analysis commands to the job queues:

EXECUTE DBMS_UTILITY.ANALYZE_PART_OBJECT(OBJECT_NAME=>'facts',

        COMMAND_TYPE=>'E',

        SAMPLE_CLAUSE=>'SAMPLE 1 PERCENT');

In a second pass, you could refine statistics for the indexed columns and the index (but not the non-indexed columns):

EXECUTE DBMS_UTILITY.ANALYZE_PART_OBJECT(OBJECT_NAME=>'facts',

        COMMAND_TYPE=>'C',

        COMMAND_OPT=>'FOR ALL INDEXED

        COLUMNS SIZE 1');

EXECUTE DBMS_UTILITY.ANALYZE_PART_OBJECT(OBJECT_NAME=>'facts',

        COMMAND_TYPE=>'C',

        COMMAND_OPT=>'FOR ALL INDEXES'):

The result will be a faster plan because you have targeted more important information. You are spending more resources to get good statistics on high value columns (indexes and join columns), and getting baseline statistics for the rest of the data.

Understanding Parallel Execution Performance Issues

The Formula for Memory, Users, and Parallel Server Processes

Key to the tuning of parallel operations is an understanding of the relationship between memory requirements, number of users (processes) a system can support, and maximum number of parallel server processes. The goal is to obtain the dramatic performance enhancement made possible by parallelizing certain operations, and by using hash joins rather than sort merge joins. This performance goal must often be balanced with the need to support multiple users.

In considering the maximum number of processes a system can support, it is useful to divide the processes into three classes, based on their memory requirements. The maximum number of processes which can fit in memory can then be analyzed as follows:

Figure 19-10: Formula for Memory/Users/Server Relationship

sga_size
+ (# low_memory_processes * low_memory_required)
+ (# medium_memory_processes * medium_memory_required)
+ (# high_memory_processes * high_memory_required)
_________________________________
total memory required

Setting Buffer Pool Size for Parallel Operations

Table 19-2 defines high, medium, and low memory processes. The formula whereby you can calculate the maximum number of processes your system can support (referred to here as max_processes) is:


#low_memory_processes
+ #medium_memory_processes
+ # high_memory_processes
____________________
max_processes

In general, if max_processes is much bigger than the number of users, you can consider running parallel operations. If max_processes is considerably less than the number of users, you must consider other alternatives, such as those described in the following section.

Note that parallel operations rarely benefit from large buffer pool sizes. An exception is when the buffer pool can be made larger and thereby accommodate the inner table or index for a nested loop join.

With the exception of parallel update and delete, parallel operations do not generally benefit from larger buffer pool sizes. Parallel update and delete benefit from a larger buffer pool when they update indexes. This is because index updates have a random access pattern and I/O activity can be reduced if an entire index or its interior nodes can be kept in the buffer pool.

Table 19-2: Memory Requirements for Three Classes of Process
Low Memory Processes: 100K to 1MB  

These processes include table scans; index lookups; index nested loop joins; single row aggregates (such as sum or average with no GROUP BYs, or very few groups); sorts that return only a few rows; and direct loading.

This class of DSS process is similar to OLTP processes in the amount of memory required. Process memory could be as low as a few hundred kilobytes of fixed overhead. You could potentially support thousands of users performing this kind of operation. You can take this requirement even lower by using the multi-threaded server, and support even more users.  

Medium Memory Processes: 1MB to 10MB  

This class of process includes large sorts; sort merge join; GROUP BYs or ORDER BYs returning a large number of rows; parallel insert operations which involve index maintenance; and index creation.

These processes require the fixed overhead needed by a low memory process, plus one or more sort areas, depending on the operation. For example, a typical sort merge join would sort both its inputs--resulting in two sort areas. Group by or order by with many groups or rows also requires sort areas.

Look at the EXPLAIN PLAN output for the operation to identify the number and type of joins, and the number and type of sorts. Optimizer statistics in the plan show the size of the operations. When planning joins, remember that you do have a number of choices.  

High Memory Processes: 10MB to 100MB  

High memory processes include one or more hash joins; or a combination of one or more hash joins with large sorts.

These processes require the fixed overhead needed by a low memory process, plus hash area. The hash area size required might range from 8MB to 32MB, and you might need two of them. If you are performing 2 or more serial hash joins, each process uses 2 hash areas. In a parallel operation, since each parallel server process does at most 1 hash join at a time, you would need 1 hash area size per server.

In summary, the amount of hash join memory for an operation equals parallel degree multiplied by hash area size, multiplied by the minimum of either 2, or the number of hash joins in the operation.  

Note: The process memory requirements of PDML and PDDL operations also depend upon the query portion of the statement.

See Also: Oracle8 Server Concepts for a comparison of hash joins and sort merge joins.

"Tuning the Buffer Cache" on page 14-28 for more discussion of setting buffer pool size.

How to Balance the Formula

You can use the following techniques to balance the memory/user/server formula given in Figure 19-10:

Oversubscribe, with Attention to Paging

You can permit the potential workload to exceed the limits recommended in the formula. Total memory required, minus the SGA size, can be multiplied by a factor of 1.2, to allow for 20% oversubscription. Thus, if you have 1G of memory, you might be able to support 1.2G of demand: the other 20% could be handled by the paging system.

Your system may be able to perform acceptably even if oversubscribed by 60%, if on average not all of the processes are performing hash joins concurrently. Users might then try to use more than the available memory, so you must monitor paging activity in such a situation. If paging goes up dramatically, consider another alternative.

On average, no more than 5% of the time should be spent simply waiting in the operating system on page faults. To spend more wait time than this indicates an I/O bound condition of the paging subsystem. Use your operating system monitor to check wait time: The sum of time waiting and time running equals 100%. If you are running close to 100% CPU, then you are not waiting. If you are waiting, it should not be on account of paging.

If wait time for paging devices exceeds 5%, it is a strong indication that you must reduce memory requirements. This could mean reducing the memory required for each class of process, or reducing the number of processes in memory-intensive classes. It could mean adding memory. Or it could indicate an I/O bottleneck in the paging subsystem that you could resolve by striping.

Note: You must verify that a particular degree of oversubscription will be viable on your system by monitoring the paging rate and making sure you are not spending more than a very small percent of the time waiting for the paging subsystem.

Reduce the Number of Memory-intensive Processes

Adjusting the Degree of Parallelism. Not only can you adjust the number of operations that run in parallel, but you can also adjust the degree of parallelism with which operations run. To do this you would issue an ALTER TABLE statement with a PARALLEL clause, or use a hint. See the Oracle8 Server SQL Reference for more information.

You can limit the parallel pool by reducing the value of PARALLEL_MAX_SERVERS. This places a system-level limit on the total amount of parallelism, and is easy to administer. More processes will then be forced to run in serial mode.

Scheduling Parallel Jobs. Rather than reducing parallelism for all operations, you may be able to schedule large parallel batch jobs to run with full parallelism one at a time, rather than concurrently. Queries at the head of the queue would have a fast response time, those at the end of the queue would have a slow response time. Queueing jobs is thus another way to reduce the number of processes but not reduce parallelism; its disadvantage, however, is a certain amount of administrative overhead.

Decrease DSS Memory per Process

Note: The following discussion focuses upon the relationship of HASH_AREA_SIZE to memory, but all the same considerations apply to SORT_AREA_SIZE. The lower bound of SORT_AREA_SIZE, however, is not as critical as the 8 MB recommended minimum HASH_AREA_SIZE.

If every operation performs a hash join and a sort, the high memory requirement limits the number of processes you can have. To allow more users to run concurrently you may need to reduce the DSS process memory.

Moving Processes from High to Medium Memory Requirements. You can move a process from the high-memory class to moderate-memory by changing from hash join to merge join. You can use initialization parameters to limit available memory and thus force the optimizer to stay within certain bounds.

To do this, you can reduce HASH_AREA_SIZE to well below the recommended minimum (for example, to 1-2MB). Then you can let the optimizer choose sort merge join more often (as opposed to telling the optimizer never to use hash joins). In this way, hash join can still be used for small tables: the optimizer has a memory budget within which it can make decisions about which join method to use.

Remember that the recommended parameter values provide the best response time. If you severely limit these values you may see a significant effect on response time.

Moving Processes from High or Medium Memory Requirements to Low Memory Requirements. If you need to support thousands of users, you must create access paths such that operations do not touch much data. Decrease the demand for index joins by creating indexes and/or summary tables. Decrease the demand for GROUP BY sorting by creating summary tables and encouraging users and applications to reference summaries rather than detailed data. Decrease the demand for ORDER BY sorts by creating indexes on frequently sorted columns.

Decrease Parallelism for Multiple Users

In general there is a trade-off between parallelism for fast single user response time, and efficient use of resources for multiple users. For example, a system with 2G of memory and a HASH_AREA_SIZE of 32MB can support about 60 parallel server processes. A 10 CPU machine can support up to 3 concurrent parallel operations (2 * 10 * 3 = 60). In order to support 12 concurrent parallel operations, the DBA could override the default parallelism (reduce it); decrease HASH_AREA_SIZE; buy more memory, or use some combination of these three strategies. For example, the DBA could ALTER TABLE t PARALLEL (DEGREE 5) for all parallel tables t, set HASH_AREA_SIZE to 16M, and increase PARALLEL_MAX_SERVERS to 120. By reducing the memory of each parallel server by a factor of 2, and reducing the parallelism of a single operation by a factor 2, the system can accommodate 2 * 2 = 4 times more concurrent operations.

The penalty for taking such an approach is that when a single operation happens to be running, the system will use just half the CPU resource of the 10 CPU machine. The other half will be idle until another operation is started.

To determine whether your system is being fully utilized, you can use one of the graphical system monitors which are available on most operating systems. These monitors often give you a better idea of CPU utilization and system performance than monitoring the execution time of an operation. Consult your operating system documentation to determine whether your system supports graphical system monitors.

Examples: Balancing Memory, Users, and Processes

The examples in this section show how to evaluate the relationship between memory, users, and parallel server processes, and balance the formula given in Figure 19-10. They show concretely how you might adjust your system workload so as to accommodate the necessary number of processes and users.

Example 1

Assume that your system has 1G of memory, the degree of parallelism is 10, and that your users perform 2 hash joins with 3 or more tables. If you need 300MB for the SGA, that leaves 700MB to accommodate processes. If you allow a generous hash area size (32MB) for best performance, then, for example, your system can support the following:

1 parallel operation (32MB * 10 * 2 = 640MB)
1 serial operation (32MB * 2 = 64MB)
for a total of 704MB. (Note that the memory is not significantly oversubscribed.)

Remember that every parallel, hash, or sort merge join operation takes a number of parallel server processes equal to twice the degree of parallelism, and often each individual process of a parallel operation uses a lot of memory. Thus you can support many more users by having them run serially, or by having them run with less parallelism.

To service more users, you can drastically reduce hash area size to 2MB. You may then find that the optimizer switches some operations to sort merge join. This configuration can support 17 parallel operations, or 170 serial operations, but response times may be significantly higher than if you were using hash joins.

Notice the trade-off above: by reducing memory per process by a factor of 16, you can increase the number of concurrent users by a factor of 16. Thus the amount of physical memory on the machine imposes another limit on total number of parallel operations you can run involving hash joins and sorts.

Example 2

In a mixed workload example, consider the following user population:

In this situation, you would have to make some choices. You could not allow everyone to run hash joins, even though they outperform sort merge joins--because you do not have the memory to support this level of workload.

You might take 20 parallel server processes, and set HASH_AREA_SIZE to a mid-range value, perhaps 20MB, for a single powerful batch job in the high memory class. Twenty servers multiplied by 20MB equals 400MB of memory. (This might be a big GROUP BY with join to produce a summary of data.)

You might plan for 10 analysts running sequential operations that use complex hash joins accessing a large amount of data. (You would not allow them to do parallel operations because of memory requirements.) Ten such sequential processes at 40MB apiece equals 400MB of memory.

Finally, to support hundreds of users doing low memory processes at about 0.5MB apiece, you might reserve 200MB.

You might consider it safe to oversubscribe at 50% because of the infrequent batch jobs during the day. This would give you enough virtual memory for the workload described above (700MB * 1.5 = 1.05GB).

Example 3

Suppose there are 200 parallel server processes and 100 users doing heavy DSS involving hash joins. You decide to leave such tasks as index retrievals and small sorts out of the picture, concentrating on the high memory processes. You might have 300 processes, of which 200 must come from the parallel pool and 100 are single threaded. One quarter of the total 2G of memory might be used by the SGA, leaving 1.5 G of memory to handle all the processes. You could apply the formula considering only the high memory requirements, including a factor of 20% oversubscription:

Figure 19-11: Formula for Memory/User/Server Relationship: DSS Process Memory

Here, 5 MB = 1.8G/300. Less than 5MB of hash area would be available for each process, whereas 8 MB is the recommended minimum. If you must have 300 processes, you may need to force them to use other join methods in order to change them from the highly memory intensive class to the moderately memory intensive class. Then they may fit within your system's constraints.

Example 4

Consider a system with 2 gigabytes of memory and 10 users who want to run intensive DSS parallel operations concurrently and still have good performance. If you choose parallelism of degree 10, then the 10 users will require 200 processes. (Processes running big joins need twice the number of parallel server processes as the degree of parallelism, so you would set PARALLEL_MAX_SERVERS to 10 * 10 * 2.) In this example each process would get 1.8G/200, or about 9MB of hash area--which should be adequate.

With only 5 users doing large hash joins, each process would get over 16 MB of hash area, which would be fine. But if you want 32 MB available for lots of hash joins, the system could only support 2 or 3 users. By contrast, if users are just computing aggregates the system needs adequate sort area size--and can have many more users.

Example 5

If such a system needs to support 1000 users who must all run big operations you must evaluate the situation carefully. Here, the per user memory budget is only 1.8MB (that is, 1.8G divided by 1,000). Since this figure is at the low end of the medium weight class, you must rule out parallel operations, which use even more resources. You must also rule out big hash joins. Each sequential process could require up to 2 hash areas plus sort area, so you would have to set HASH_AREA_SIZE to the same value as SORT_AREA_SIZE, which would be 600K (1.8MB/3). Such a small hash area size is likely to be ineffective, so you may opt to disable hash joins altogether.

Given the organization's resources and business needs, is it reasonable for you to upgrade your system's memory? If memory upgrade is not an option, then you must change your expectations. To adjust the balance you might:

Parallel Execution Space Management Issues

This section describes space management issues that come into play when using parallel execution.

These issues become particularly important for parallel operation running on a parallel server, where tuning becomes more critical the more nodes involved.

ST Enqueue

Every space management transaction in the database (such as creation of temporary segments in PARALLEL CREATE TABLE, or parallel direct-load inserts of non-partitioned tables) is controlled by a single ST enqueue. A high transaction rate (more than 2 or 3 per minute) on the ST enqueue may result in poor scalability on Oracle Parallel Server systems with many nodes, or a timeout waiting for space management resources.

Sorts and Temporary Data

Try to minimize the number of space management transactions, in particular:

Use dedicated temporary tablespaces to optimize space management for sorts. This is particularly beneficial on a parallel server. You can monitor this using V$SORT_SEGMENT.

Set initial and next extent size to a value in the range of 1MB to 10MB. Processes may use temporary space at a rate of up to 1MB per second. Do not accept the default value of 40K for next extent size, because this will result in many requests for space per second.

If you are unable to allocate extents for various reasons, you can recoalesce the space by using the ALTER TABLESPACE COALESCE SPACE command. This should be done on a regular basis for temporary tablespaces in particular.

See Also: "Setting Up Temporary Tablespaces for Parallel Sort and Hash Join" on page 19-37

External Fragmentation

External fragmentation is a concern for parallel load, direct-load insert, and PARALLEL CREATE TABLE AS SELECT. To reduce external fragmentation on partitioned tables, have all extents the same size. Set MINEXTENTS equal to NEXT, equal to INITIAL; set PERCENT_INCREASE to zero. The system can handle this well with a few thousand extents per object, so you can set MAXEXTENTS to a few thousand. For tables that are not partitioned, the initial extent should be small.

Optimizing Parallel Execution on Oracle Parallel Server

This section describe several aspects of parallel execution on OPS.

Lock Allocation

This section provides parallel execution tuning guidelines for optimal lock management on Oracle Parallel Server.

To optimize parallel execution on Oracle Parallel Server, you need to correctly set GC_FILES_TO_LOCKS. On Oracle Parallel Server a certain number of parallel cache management (PCM) locks are assigned to each data file. DBA locking in its default behavior assigns one lock to each block. During a full table scan a PCM lock must then be acquired for each block read into the scan. To speed up full table scans, you have three possibilities:

To speed up PDML operations, consider using hashed locking rather than DBA locking. A parallel server process works on non-overlapping partitions; it is recommended that partitions not share files. You can thus reduce the number of lock operations by having only 1 hashed lock per file. Since the parallel server process only works on non-overlapping files, there will be no lock pings.

The following guidelines impact memory usage, and thus indirectly affect performance:

For example, on a read-only database with a data warehousing application's query-only workload, you might create 500 PCM locks on the SYSTEM tablespace in file 1, then create 50 more locks to be shared for all the data in the other files. Space management work will then never interfere with the rest of the database.

See Also: Oracle8 Parallel Server Concepts & Administration for a thorough discussion of PCM locks and locking parameters.

Allocation of Processes and Instances

Oracle computes a target degree of parallelism by examining the maximum of the degree for each table and other factors, before runtime. At runtime, a parallel operation will be executed sequentially if insufficient parallel server processes are available. PARALLEL_MIN_PERCENT sets the minimum percentage of the target number of parallel server processes which must be available, if the operation is to run in parallel. When PARALLEL_MIN_PERCENT is set to n, an error message will be sent if n percent parallel server processes are not available. If no parallel server processes are available, a parallel operation will be executed sequentially.

Parallel execution assigns each instance a unique number, which is determined by the INSTANCE_NUMBER initialization parameter. The instance number regulates the order of instance startup.

Note: For Oracle Parallel Server, the PARALLEL_INSTANCE_GROUP parameter determines what instance group will be used for a particular operation. For more information, see Oracle8 Parallel Server Concepts & Administration.

Load Balancing for Multiple Concurrent Parallel Operations

Load balancing is an effort to distribute the parallel server processes to achieve even CPU and memory utilization, and to minimize remote I/O and communication between nodes.

When multiple concurrent operations are running on a single node, load balancing is done by the operating system. For example, if there are 10 CPUs and 5 parallel server processes, the operating system distributes the 5 processes among the CPUs. If a second user is added, the operating system still distributes the workload.

For a parallel server, however, no single operating system performs the load balancing: instead, parallel execution performs this function.

ALTER TABLE tablename PARALLEL (INSTANCES 1)

If an operation requests more than one instance, allocation priorities involve table caching and disk affinity.

ALTER TABLE tablename PARALLEL (INSTANCES 2)

Thus, if there are 5 parallel server processes, it is advantageous for them to run on as many nodes as possible.

In Oracle Server release 8.0, allocation of processes and instances is based on instance groups With instance groups a parallel server system will be partitioned into disjoint logical subsystems. Parallel resources will only be allocated out of a particular instance group if the parallel coordinator is part of the group. This approach supports application and data partitioning, and benefits load balancing.

See Also: Oracle8 Parallel Server Concepts & Administration for more information about instance groups.

Disk Affinity

Some Oracle Parallel Server platforms use disk affinity: processes are allocated on instances that are closest to the requested data. Without disk affinity, Oracle tries to balance the allocation evenly across instances. With disk affinity, Oracle tries to allocate parallel server processes for parallel table scans on the instances which are close to the data. Disk affinity minimizes data shipping and internode communication on a shared nothing architecture. It can significantly increase parallel operation throughput and decrease response time.

Disk affinity is used for parallel table scans, parallel temporary tablespace allocation, parallel DML, and parallel index scan, but is not used for parallel table creation or parallel index creation. Access to temporary tablespaces preferentially uses local datafiles. It guarantees optimal space management extent allocation. Disks striped by the operating system are treated by disk affinity as a single unit.

In the following example of disk affinity, table T is distributed across 3 nodes, and a full table scan on table T is being performed.

Figure 19-12: Disk Affinity Example

Resource Timeout

A parallel DML transaction spanning Oracle Parallel Server instances may be waiting too long for a resource due to potential deadlock involving this transaction and other PDML or non-PDML transactions. Set the PARALLEL_TRANSACTION_RESOURCE_TIMEOUT parameter to specify how long a PDML transaction should wait for a resource before aborting.

See Also: Oracle8 Server SQL Reference

Parallel Execution Performance Techniques

This section describes performance techniques for parallel operations.

Overriding the Default Degree of Parallelism

The default degree of parallelism is appropriate for reducing response time while guaranteeing use of CPU and I/O resources for any parallel operations. If an operation is I/O bound, you should consider increasing the default degree of parallelism. If it is memory bound, or there are several concurrent parallel operations, consider decreasing the default degree.

The default degree of parallelism is used for tables that have PARALLEL attributed to them in the data dictionary, or via the PARALLEL hint. If a table does not have parallelism attributed to it, or has NOPARALLEL (the default) attributed to it, then that table is never scanned in parallel--regardless of the default degree of parallelism that would be indicated by the number of CPUs, instances, and devices storing that table.

General guidelines are as follows:

For example, assume a parallel indexed nested loop join is I/O bound performing the index lookups, with #CPUs=10 and #disks=36. The default degree of parallelism is 10, and this is I/O bound. You could first try parallel degree 12. If still I/O bound, you could try parallel degree 24; if still I/O bound, you could try 36.

To override the default degree of parallelism:

  1. Determine the maximum number of query servers your system can support.
  2. Divide the parallel server processes among the estimated number of concurrent queries.

Rewriting the SQL

The most important issue for parallel query execution is ensuring that all parts of the query plan that process a substantial amount of data execute in parallel. Use EXPLAIN PLAN and see that all plan steps have an OTHER_TAG of PARALLEL_TO_PARALLEL, PARALLEL_TO_SERIAL, PARALLEL_COMBINED_WITH_PARENT, or PARALLEL_COMBINED_WITH_CHILD. Any other keyword (or null) indicates serial execution, and a possible bottleneck.

By making the following changes you can increase the optimizer's ability to generate parallel plans:

to

SELECT COUNT(*)FROM (SELECT DISTINCT C FROM T)

See Also: "Updating the Table" on page 19-65

Creating and Populating Tables in Parallel

Oracle cannot return results to a user process in parallel. If a query returns a large number of rows, execution of the query may indeed be faster; however, the user process can only receive the rows serially. To optimize parallel query performance with queries that retrieve large result sets, use PARALLEL CREATE TABLE AS SELECT or direct-load insert to store the result set in the database. At a later time, users can view the result set serially.

To avoid I/O bottlenecks, specify a tablespace with at least as many devices as CPUs. To avoid fragmentation in allocating space, the number of files in a tablespace should be a multiple of the number of CPUs.

Note: Parallelism of the SELECT does not influence the CREATE statement. If the CREATE is parallel, however, the optimizer tries to make the SELECT run in parallel also.

When combined with the NOLOGGING option, the parallel version of the CREATE TABLE AS SELECT statement provides a very efficient intermediate table facility. For example,

CREATE TABLE summary PARALLEL NOLOGGING  

        AS SELECT dim_1, dim_2 ..., SUM (meas_1) FROM facts

GROUP BY dim_1, dim_2;

These tables can also be incrementally loaded with parallel insert. You can take advantage of intermediate tables using the following techniques:

Consider a huge table of retail sales data that is joined to region and to department lookup tables. There are 5 regions and 25 departments. If the huge table is joined to regions using parallel hash partitioning, the maximum speedup is 5. Similarly, if the huge table is joined to departments, the maximum speedup is 25. But if a temporary table containing the Cartesian product of regions and departments is joined with the huge table, the maximum speedup is 125.

Note: Be sure to use the ANALYZE command on newly created tables. Also consider creating indexes.

Creating Indexes in Parallel

Multiple processes can work together simultaneously to create an index. By dividing the work necessary to create an index among multiple server processes, the Oracle Server can create the index more quickly than if a single server process created the index sequentially.

Parallel index creation works in much the same way as a table scan with an ORDER BY clause. The table is randomly sampled and a set of index keys is found that equally divides the index into the same number of pieces as the degree of parallelism. A first set of query processes scans the table, extracts key,rowid pairs, and sends each pair to a process in a second set of query processes based on key. Each process in the second set sorts the keys and builds an index in the usual fashion. After all index pieces are built, the coordinator simply concatenates the pieces (which are ordered) to form the final index.

Parallel local index creation uses a single server set. Each server process in the set is assigned a table partition to scan, and to build an index partition for. Because half as many server processes are used for a given degree of parallelism, parallel local index creation can be run with a higher degree of parallelism.

You can optionally specify that no redo and undo logging should occur during index creation. This can significantly improve performance, but temporarily renders the index unrecoverable. Recoverability is restored after the new index is backed up. If your application can tolerate this window where recovery of the index requires it to be recreated, then you should consider using the NOLOGGING option.

The PARALLEL clause in the CREATE INDEX command is the only way in which you can specify the degree of parallelism for creating the index. If the degree of parallelism is not specified in the parallel clause of CREATE INDEX, then the number of CPUs is used as the degree of parallelism. If there is no parallel clause, index creation will not be done in parallel.

Attention: When creating an index in parallel, the STORAGE clause refers to the storage of each of the subindexes created by the query server processes. Therefore, an index created with an INITIAL of 5MB and a PARALLEL DEGREE of 12 consumes at least 60MB of storage during index creation because each process starts with an extent of 5MB. When the query coordinator process combines the sorted subindexes, some of the extents may be trimmed, and the resulting index may be smaller than the requested 60MB.

When you add or enable a UNIQUE key or PRIMARY KEY constraint on a table, you cannot automatically create the required index in parallel. Instead, manually create an index on the desired columns using the CREATE INDEX command and an appropriate PARALLEL clause and then add or enable the constraint. Oracle then uses the existing index when enabling or adding the constraint.

Multiple already-enforced constraints can be enabled against the same table simultaneously; they are executed concurrently. Enabling already-enforced constraints can run in parallel. In the following example, the ALTER TABLE ENABLE CONSTRAINT statement performs the table scan that checks the constraint in parallel:

CREATE TABLE a (a1 NUMBER CONSTRAINT ach CHECK (a1 > 0) ENFORCE)

PARALLEL 5;

INSERT INTO a values (1);

COMMIT;

ALTER TABLE a ENABLE CONSTRAINT ach;

See Also: For more information on how extents are allocated when using the parallel query feature, Oracle8 Server Concepts.

Refer to the Oracle8 Server SQL Reference for the complete syntax of the CREATE INDEX command.

Refreshing Tables of a Data Warehouse System in Parallel

Parallel DML combined with the updatable join views facility provides an efficient solution for refreshing the tables of a data warehouse system. To refresh tables is to update them with the differential data generated from the OLTP production system.

In the following example, assume that you want to refresh a table named CUSTOMER(c_key, c_name, c_addr). The differential data contains either new rows or rows that have been updated since the last refresh of the data warehouse. In this example, the updated data is shipped from the production system to the data warehouse system by means of ASCII files. These files must be loaded into a temporary table, named hereafter DIFF_CUSTOMER, before starting the refresh process. You can use SQL*Loader with both the parallel and direct options to efficiently perform this task.

Once DIFF_CUSTOMER is loaded, the refresh process can be started. It is performed in two phases:

Updating the Table

The SQL statement can be rewritten; the straightforward SQL implementation uses subqueries:

UPDATE customer 

SET(c_name, c_addr) = ( SELECT c_name, c_addr

                FROM diff_customer

                WHERE diff_customer.c_key = customer.c_key)

WHERE c_key IN(SELECT c_key FROM diff_customer);

Unfortunately, the two subqueries in the preceding statement affect the performance.

An alternative is to rewrite this query using updatable join views. To do this you must first add a primary key constraint to the DIFF_CUSTOMER table to ensure that the modified columns map to a key-preserved table:

CREATE UNIQUE INDEX diff_pkey_ind on diff_customer(c_key) 

PARALLEL NOLOGGING;

ALTER TABLE diff_customer ADD PRIMARY KEY (c_key);

The CUSTOMER table can then be updated with the following SQL statement:

UPDATE /*+PARALLEL(customer,12)*/ (     SELECT  customer.c_name as c_name, 
customer.c_addr as c_addr,

                diff_customer.c_name as c_newname, diff_customer.c_addr as c_newaddr

        FROM customer, diff_customer

        WHERE customer.c_key = diff_customer.c_key)

SET c_name = c_newname,  c_addr = c_newaddr;

If the CUSTOMER table is partitioned, parallel DML can be used to further improve the response time. It could not be used with the previous version because of the subquery in the SET clause.

See Also: "Rewriting the SQL" on page 19-61

Oracle8 Server Application Developer's Guide for information about key-preserved tables.

Inserting the New Rows into the Table in Parallel

The last phase of the refresh process consists in inserting the new rows from the DIFF_CUSTOMER to the CUSTOMER table. Unlike the update case, you cannot avoid having a subquery in the insert statement:

INSERT /*+PARALLEL(customer,12)*/ INTO customer

SELECT * FROM diff_customer

WHERE diff_customer.c_key NOT IN (SELECT /*+ hash_aj */ key FROM 
customer);

But here, the HASH_AJ hint can be specified to transform the subquery into an anti-hash join. (The hint is not required if the parameter ALWAYS_ANTI_JOIN is set to hash in the initialization file). Doing so allows you to use parallel insert to execute the preceding statement very efficiently. Note that parallel insert is applicable even if the table is not partitioned.

Using Hints with Cost Based Optimization

Cost-based optimization is a highly sophisticated approach to finding the best execution plan for SQL statements. Cost-based optimization is automatically used with parallel execution.

Attention: You must use ANALYZE to gather current statistics for cost-based optimization. In particular, tables used in parallel should always be analyzed. Always keep your statistics current by running ANALYZE after DDL and DML operations.

Use discretion in employing hints. With the parallel aware optimizer cost-based optimization generates such good plans that hints should rarely be necessary. If used at all, hints should come as a final step in tuning, and only when they demonstrate a necessary and significant performance advantage. In such cases, begin with the execution plan recommended by cost-based optimization, and only go on to test the effect of hints after you have quantified your performance expectations.

Remember that hints are powerful; if you use them and the underlying data changes you may need to change the hints. Otherwise, the effectiveness of your execution plans may deteriorate.

Always use cost-based optimization unless you have an existing application that has been hand-tuned for rule-based optimization. If you must use rule-based optimization, rewriting a SQL statement can give orders of magnitude improvements.

Note: If any table in a query has a parallel degree greater than one (including the default degree), Oracle uses the cost-based optimizer for that query--even if OPTIMIZER_MODE = RULE, or if there is a RULE hint in the query itself.

See Also: "OPTIMIZER_PERCENT_PARALLEL" on page 19-5. This parameter controls parallel awareness.

Detecting Parallel Execution Performance Problems

This section summarizes common tools and techniques you can use to obtain performance feedback on parallel operations.

Diagnosing Problems

Use the following decision tree to diagnose parallel performance problems. Some key issues are the following:

Is There Regression?

Does parallel execution's actual performance deviate from what you expected? If performance is as you expected, can you justify the notion that there is a performance problem? Perhaps you have a desired outcome in mind, to which you are comparing the current outcome. Perhaps you have a justifiable performance expectation which the system is not achieving, You might have achieved this level of performance or particular execution plan in the past, but now, with a similar environment and operation, this is not being met.

If performance is not as you expected, can you quantify the deviation? For decision support operations, the execution plan is key. For critical DSS operations, save the EXPLAIN PLAN results. Then, as you analyze the data, reanalyze, upgrade Oracle, and load in new data over the course of time, you can compare any new execution plan with the old plan. You can take this approach either proactively or reactively.

Alternatively, you may find that you get a plan that works better if you use hints. You may want to understand why hints were necessary, and figure out how to get the optimizer to generate the desired plan without the hints. Try increasing the statistical sample size: better statistics may give you a better plan. If you had to use a PARALLEL hint, look to see whether you had OPTIMIZER_PERCENT_PARALLEL set to 100%.

Is There a Plan Change?

If there has been a change in the execution plan, determine whether the plan is (or should be) parallel or serial.

Is There a Parallel Plan?

If the execution plan is (or should be) parallel:

See Also: Parallel EXPLAIN PLAN tags are defined in Table 21-1, "Values of OTHER_TAG Column of the PLAN_TABLE", on page 21-6.

If the execution plan is (or should be) serial, consider the following strategies:

Note: Using different sample sizes can cause the plan to change. Generally, the higher the sample size, the better the plan.

See Also: Oracle8 Server Concepts regarding CREATE TABLE AS SELECT

Is There Parallel Execution?

If the cause of regression cannot be traced to problems in the plan, then the problem must be an execution issue.

For DSS operations, both serial and parallel, consider memory. Check the paging rate and make sure the system is using memory as effectively as possible. Check buffer, sort, and hash area sizing.

After you run a query or DML operation, you can look at the V$PQ_SESSTAT and V$PQ_SYSSTAT views to see the number of server processes used, and other information for the session and system.

See Also: "Querying the Dynamic Performance Views: Example" on page 19-79

Is There Skew?

If parallel execution is occurring, is there unevenness in workload distribution? For example, if there are 10 CPUs and a single user, you can see whether the workload is evenly distributed across CPUs. This may vary over time, with periods that are more or less I/O intensive, but in general each CPU should have roughly the same amount of activity.

The statistics in V$PQ_TQSTAT show rows produced and consumed per parallel server process. This is a good indication of skew, and does not require single user operation.

Operating system statistics show you the per-processor CPU utilization and per-disk I/O activity. Concurrently running tasks make it harder to see what is going on, however. It would be useful to run in single user mode and check operating system monitors which show system level CPU and I/O activity.

When workload distribution is unbalanced, a common culprit is the presence of skew in the data. For a hash join, this may be the case if the number of distinct values is less than the degree of parallelism. When joining two tables on a column with only 4 distinct values, you will not get scaling on more than 4. If you have 10 CPUs, 4 of them will be saturated but 6 will be idle. To avoid this problem, change the query: use temporary tables to change the join order such that all operations have more values in the join column than the number of CPUs.

If I/O problems occur you may need to reorganize your data, spreading it over more devices. If parallel execution problems occur, check to be sure you have followed the recommendation to spread data over at least as many devices as CPUs.

If there is no skew in workload distribution, check for the following conditions:

Executing Parallel SQL Statements

After analyzing your tables and indexes you should be able to run operations and see speedup that scales linearly with the degree of parallelism used. The following operations should scale:

Start with simple parallel operations. Evaluate total I/O throughput with SELECT COUNT(*) FROM facts. Evaluate total CPU power by adding a complex WHERE clause. I/O imbalance may suggest a better physical database layout. After you understand how simple scans work, add aggregation, joins, and other operations that reflect individual aspects of the overall workload. Look for bottlenecks.

Query performance is not the only thing you must monitor. You should also monitor parallel load, parallel index creation, and parallel DML, and look for good utilization of I/O and CPU resources.

Using EXPLAIN PLAN to See How an Operation is Parallelized

Use an EXPLAIN PLAN statement to view the sequential and parallel operation plan. The OTHER_TAG column of the plan table summarizes how each plan step is parallelized, and describes the text of the operation that is used by parallel server processes for each operation. Optimizer cost information for selected plan steps is given in the cost, bytes, and cardinality columns. Table 21-1 on page 21-6 summarizes the meaning of the OTHER_TAG column.

EXPLAIN PLAN thus provides detailed information as to where specific operations are being performed. You can then change the execution plan for better performance. For example, if many steps are serial (where OTHER_TAG is blank, serial to parallel, or parallel to serial), then the parallel controller could be a bottleneck.

Consider the following example SQL statement, which summarizes sales data by region:

EXPLAIN PLAN SET STATEMENT_ID = `Jan_Summary' FOR

SELECT dim_1 SUM(meas1) FROM facts WHERE dim_2 < `02-01-1995'

GROUP BY dim_1

The following SQL script extracts a compact hierarchical plan from the output of EXPLAIN PLAN:

SELECT 

SUBSTR( LPAD(' ',2*(level-1)) ||

DECODE(id, 0, statement_id, operation) ||

' '  || options || ' ' || object_name ||

' [' || partition_start || ',' || partition_stop ||

'] ' || other_tag, 

1, 79)  "step [start,stop] par"

FROM plan_table

START WITH id = 0

CONNECT BY PRIOR id = parent_id

AND PRIOR NVL(statement_id,' ') = 

NVL(statement_id);

Following is the query plan for "Jan_Summary":

Jan_Summary [,]

        SORT GROUP BY [,] PARALLEL_TO_SERIAL

        TABLE ACCESS FULL facts  [NUMBER(1),NUMBER(1)] PARALLEL_TO_PARALLEL

Each parallel server scans a portion of the first partition of the facts table. All other partitions are pruned, as shown by the stop and start partition number.

The following figure illustrates how, with the PARALLEL_TO_PARALLEL keyword, data from the partitioned table is redistributed to a second set of parallel server processes for parallel grouping. Query server set 1 executes the table scan from the preceding example. The rows coming out are repartitioned through the table queue to parallel server set 2, which executes the group by operation. Because the GROUP BY operation indicates PARALLEL_TO_SERIAL, another table queue collects its results and sends it to the parallel coordinator, and then to the user.

Figure 19-14: Data Redistribution among Parallel Server Processes

As a rule, if the PARALLEL_TO_PARALLEL keyword exists, there will be two sets of parallel server processes. This means that for grouping, sort merge, or hash joins, twice the number of parallel server processes will be assigned to the operation. This requires redistribution of data or rows from set 1 to set 2. If there is no PARALLEL_TO_PARALLEL keyword, then the operation will get just one set of servers. Such serial processes include aggregations, such as COUNT(*) FROM facts or SELECT * FROM facts WHERE DATE = '7/1/94'.

For non-distributed operations, the OBJECT_NODE column gives the name of the table queue. If the PARALLEL_TO_PARALLEL keyword exists, then the EXPLAIN PLAN of the parent operation should have SQL that references the child table queue in its FROM clause. In this way it describes the order in which the output from operations is consumed.

See Also: Chapter 21, "The EXPLAIN PLAN Command"

Dynamic Performance Tables

Dynamic performance tables are views of internal Oracle8 data structures and statistics that you can query periodically to monitor progress of a long-running operation. When used in conjunction with data dictionary views, these tables provide a wealth of information. The challenge is visualizing the data and then acting upon it.

Note: On Oracle Parallel Server there are global versions of these views which aggregate the information over multiple instances. The global views have analogous names such as GV$FILESTAT for V$FILESTAT, and so on.

See Also: Oracle8 Parallel Server Concepts & Administration for more information about global dynamic performance views.

V$FILESTAT

This view sums read and write requests, number of blocks, and service times for every datafile in every tablespace. It can help you diagnose I/O problems and workload distribution problems.

The file numbers listed in V$FILESTAT can be joined to those in the DBA_DATA_FILES view to group I/O by tablespace or to find the file name for a given file number. By doing ratio analysis you can find what percentage of the total tablespace activity for each file in the tablespace If you make a practice of putting just one large, heavily accessed object in a tablespace, you can use this technique to identify objects that have a poor physical layout.

You can further diagnose disk space allocation problems using the DBA_EXTENTS view. Ensure that space is allocated evenly from all files in the tablespace. Monitoring V$FILESTAT during a long running operation and correlating I/O activity to the explain plan output is a good way to follow progress.

V$PARAMETER

This view lists the name, current value, and default value of all system parameters. In addition, the view indicates whether the parameter may be modified online with an ALTER SYSTEM or ALTER SESSION command.

V$PQ_SESSTAT

This view is valid only when queried from a session that is executing parallel SQL statements. Thus it cannot be used to monitor a long running operation. It gives summary statistics about the parallel statements executed in the session, including total number of messages exchanged between server processes and the actual number of parallel server processes used.

V$PQ_SLAVE

This view tallies the current and total CPU time and number of messages sent and received per parallel server process. It can be monitored during a long running operation. Verify that there is little variance among processes in CPU usage and number of messages processed. A variance may indicate a load balancing problem. Attempt to correlate the variance to a variance in the base data distribution. Extreme imbalance could indicate that the number of distinct values in a join column is much less than the degree of parallelism. See "Parallel CREATE TABLE AS SELECT" in Oracle8 Server Concepts for a possible workaround.

V$PQ_SYSSTAT

The V$PQ_SYSSTAT view aggregates session statistics from all parallel server processes. It sums the total parallel server message traffic, and gives the status of the pool of parallel server processes.

This view can help you determine the appropriate number of parallel server processes for an instance. The statistics that are particularly useful are "Servers Busy", "Servers Idle", "Servers Started", and "Servers Shutdown".

Periodically examine V$PQ_SYSSTAT to determine if the parallel server processes for the instance are actually busy. To determine whether the instance's parallel server processes are active, issue the following query:

SELECT * FROM V$PQ_SYSSTAT 

        WHERE statistic = "Servers Busy";


STATISTIC             VALUE

--------------------- -----------

Servers Busy          70

V$PQ_TQSTAT

This view provides a detailed report of message traffic at the level of the table queue. It is valid only when queried from a session that is executing parallel SQL statements. A table queue is the pipeline between parallel server groups or between the parallel coordinator and a parallel server group or between a parallel server group and the coordinator. These table queues are represented in the plan by the tags PARALLEL_TO_PARALLEL, SERIAL_TO_PARALLEL, or PARALLEL_TO_SERIAL, respectively.

The view contains a row for each parallel server process that reads or writes each table queue. A table queue connecting 10 consumers to 10 producers will have 20 rows in the view. Sum the bytes column and group by TQ_ID for the total number of bytes sent through each table queue. Compare this with the optimizer estimates; large variations may indicate a need to analyze the data using a larger sample.

Compute the variance of bytes grouped by TQ_ID. Large variances indicate workload imbalance. Drill down on large variances to determine if the producers start out with unequal distributions of data, or whether the distribution itself is skewed. The latter may indicate a low number of distinct values.

For many of the dynamic performance tables, the system parameter TIMED_STATISTICS must be set to TRUE in order to get the most useful information. You can use ALTER SYSTEM to turn TIMED_STATISTICS on and off dynamically.

See Also: For more information, see Chapter 20, "The Dynamic Performance Tables".

Querying the Dynamic Performance Views: Example

The following example illustrates output from two of these views:

SQLDBA> update /*+ parallel (iemp, 2) */ iemp set empno = empno +1; 

91 rows processed. 

SQLDBA> commit; 

Statement processed. 

SQLDBA> select * from v$pq_sesstat; 

STATISTIC                      LAST_QUERY SESSION_TOTAL 

------------------------------ ---------- ---------- 

Queries Parallelized                    0          0 

DML Parallelized                        1          2 

DFO Trees                               1          2 

Server Threads                          2          0 

Allocation Height                       2          0 

Allocation Width                        0          0 

Local Msgs Sent                        34         60 

Distr Msgs Sent                         0          0 

Local Msgs Recv'd                      34         60 

Distr Msgs Recv'd                       0          0 

11 rows selected. 

 

SQLDBA> select * from v$pq_sysstat; 

STATISTIC                      VALUE      

------------------------------ ---------- 

Servers Busy                            0 

Servers Idle                            2 

Servers Highwater                       2 

Server Sessions                         4 

Servers Started                         2 

Servers Shutdown                        0 

Servers Cleaned Up                      0 

Queries Initiated                       0 

DML Initiated                           2  

DFO Trees                               2 

Local Msgs Sent                        60 

Distr Msgs Sent                         0 

Local Msgs Recv'd                      60 

Distr Msgs Recv'd                       0 

15 rows selected. 

In V$PQ_SESSTAT, some of the statistics provide the following information..

DML Parallelized  

number of statements with insert, delete and update which were parallelized by the last operation and by this session  

Queries Parallelized  

number of all other parallel statements  

DFO Trees  

number of fragments of query plan which were parallelized  

Server Threads  

total number of server processes (typically 2x degree)  

Allocation Height  

requested number of servers on each instance  

Allocation Width  

requested number of instances  

In V$PQ_SYSSTAT, the "DML Initiated" statistic indicates the number of DML operations done in the system.

Note that statements such as INSERT ... SELECT are treated as a single DML statement, not as one DML statement and one query.

See Also: Oracle8 Server SQL Reference for information about statistics.

Operating System Statistics

There is considerable overlap between information available in Oracle and information available though operating system utilities (such as sar and vmstat, on UNIX-based systems). Operating systems provide performance statistics on I/O, communication, CPU, memory and paging, scheduling, and synchronization primitives. The Oracle V$SESSTAT view provides the major categories of OS statistics as well.

Typically it is harder to map OS information about I/O devices and semaphore operations back to database objects and operations; on the other hand the OS may have better visualization tools and more efficient means of collecting the data.

OS information about CPU and memory usage is very important for assessing performance. Probably the most important statistic is CPU usage. The goal of low level performance tuning is to become CPU bound on all CPUs. Once this is achieved, you can pop up a level and work at the SQL level to find an alternate plan that is perhaps more I/O intensive but uses less CPU.

OS memory and paging information is valuable for fine tuning the many system parameters that control how memory is divided among memory-hungry warehouse subsystems like parallel communication, sort, and hash join.

Recovery Time

If the system requires ten minutes to run an operation, it will take at least 10 minutes to rollback the operation: this is the best performance achievable. If there is some instance failure and some system failure, then recovery time will increase because not all the slave processes are available to provide rollback.

PDML Restrictions

Parallel DML and direct-load insert restrictions are described in Oracle8 Server Concepts. If a parallel restriction is violated, the operation is simply performed serially. If a direct-load insert restriction is violated, then the APPEND hint is ignored and a conventional insert is performed. No error message is returned.

You must either commit or roll back directly after you issue a parallel INSERT, UPDATE, or DELETE statement, or a serial insert with the APPEND hint. You can perform no other SQL commands until this is done.

Tuning Parallel Insert Performance

This section provides an overview of parallel operation functionality.

See Also: Parallel Data Manipulation Language (PDML) and degree of parallelism are discussed at length in the Oracle8 Server Concepts.

For a discussion of PDML affinity, please see Oracle8 Parallel Server Concepts & Administration.

INSERT

Oracle8 INSERT functionality can be summarized as follows:

Table 19-3: Summary of INSERT Features

Insert Type  

Parallel  

Serial  

NOLOGGING  

Conventional  

No  

Yes  

No  

Direct Load Insert
(Append)
 

Yes: requires
* ALTER SESSION ENABLE
PARALLEL DML
* Table PARALLEL attribute or
PARALLEL hint
* APPEND hint (optional)  

Yes: requires
* APPEND hint  

Yes: requires
* NOLOGGING attribute set for table or partition  

If PDML is enabled and there is a PARALLEL hint or PARALLEL attribute set for the table in the data dictionary, then inserts will be parallel and append, unless a restriction applies. If either the PARALLEL hint or PARALLEL attribute is missing, then the insert is performed serially.

Direct-load INSERT

Append mode is the default when a parallel insert is done, therefore the APPEND hint is optional. It means that data is always inserted into a new block which is allocated to the table. Append mode should be used to increase the speed of insert operations; it should not be used when space utilization needs to be optimized. You can use NOAPPEND to override append mode.

Note that the APPEND hint applies to both serial and parallel insert: even serial insert will be faster if you use it. You can use NOLOGGING with APPEND to make the process even faster. APPEND, however, does require more space and locking overhead.

NOLOGGING means that no redo is generated for the operation. NOLOGGING is never the default; use it when you wish to optimize performance. It should not normally be used when recovery is needed for the table or partition. If recovery is needed, be sure to take a backup immediately after the operation. Use the ALTER TABLE [NO]LOGGING statement to set the appropriate value.

See Also: Oracle8 Server Concepts

Parallelizing INSERT, UPDATE, and DELETE

When the table or partition has the PARALLEL attribute in the data dictionary, it is used to determine parallelism of UPDATE and DELETE statements as well as queries. An explicit PARALLEL hint for a table in a statement overrides the effect of the PARALLEL attribute in the data dictionary

Insert, update, and delete operations consist of two parts: a select operation identifying the rows to be inserted, updated, or deleted; and the actual data modification operation. The PARALLEL hint of an INSERT, UPDATE, or DELETE statement applies to both portions of the statement.

Note: Within a given statement, there is independent parallelism of the insert, update, delete portions, and the select portion. If one operation cannot be performed in parallel, it has no effect on whether the other operation can be performed in parallel.

You can use the NOPARALLEL hint to override a PARALLEL attribute for the table in the data dictionary. Note, in general, that hints take precedence over attributes.

DML operations are only considered for parallelization if the session is in a PARALLEL DML enabled mode. (Use ALTER SESSION ENABLE PARALLEL DML to enter this mode.) The mode does not affect parallelization of queries or of the query portions of a DML statement.

See Also: Oracle8 Server Concepts for more information on parallel INSERT, UPDATE and DELETE.

Parallelizing INSERT .. SELECT

In the INSERT.. SELECT statement you can specify a PARALLEL hint after the INSERT keyword, in addition to the hint after the SELECT keyword. The PARALLEL hint after the INSERT keyword applies to the insert operation only, and the PARALLEL hint after the SELECT keyword applies to the select operation only. Thus parallelism of the insert and select operations are independent of each other.

The ability to parallelize INSERT causes a change in existing behavior, if the user has explicitly enabled the session for parallel DML, and if the table in question has a PARALLEL attribute set in the data dictionary entry. In that case existing INSERT .. SELECT statements which have their query operation parallelized, may also have their insert operation parallelized.

Note also that since you can query multiple tables, you can specify multiple SELECT PARALLEL hints and multiple PARALLEL attributes.

Example

Add the new employees who came on board after the acquisition of ACME.

INSERT /*+ PARALLEL(emp,4) */ INTO emp

                SELECT /*+ PARALLEL(acme_emp,4) */ * 

                FROM acme_emp;

The APPEND keyword is not required in this example, since it is implied by the PARALLEL hint.

Parallelizing UPDATE and DELETE

The PARALLEL hint (placed immediately after the UPDATE or DELETE keyword) applies not only to the underlying select operation, but also to the update/delete operation. Alternatively, you can specify update/delete parallelism in the PARALLEL clause specified in the definition of the table to be modified.

If you have explicitly enabled the session or transaction for parallel DML, UPDATE/DELETE statements which have their query operation parallelized may also have their UPDATE/DELETE operation parallelized.

If the statement contains subqueries or updatable views, they may have their own separate parallel hints or clauses, but these parallel directives do not affect the decision to parallelize the update or delete.

Parallel UPDATE and DELETE can only be done on partitioned tables.

Example 1

Give a 10% salary raise to all clerks in Dallas.

UPDATE /*+ PARALLEL(emp,5) */ emp 

                SET sal=sal * 1.1

                WHERE job='CLERK' and 

                        deptno in 

           (SELECT deptno FROM dept WHERE location='DALLAS');

The PARALLEL hint is applied to the update operation as well as to the scan.

Example 2

Fire all employees in the accounting department, which will now be outsourced.

DELETE /*+ PARALLEL(emp,2) */ FROM emp

                WHERE deptno IN 

      (SELECT deptno FROM dept WHERE dname='ACCOUNTING');

Again, the parallelism will be applied to the scan as well as update operation on table EMP.

Additional PDML Examples

The following examples show the use of parallel DML.

Note: As these examples demonstrate, you must enable parallel DML before using the PARALLEL or APPEND hints. You must issue a COMMIT or ROLLBACK command immediately after executing parallel INSERT, UPDATE, or DELETE. You can issue no other SQL commands before committing or rolling back.

The following statement enables parallel DML:

ALTER SESSION ENABLE PARALLEL DML; 

Serial as well as parallel direct-load insert requires commit/rollback immediately afterwards.

INSERT   /*+ APPEND NOPARALLEL(table1) */ INTO table1

A select statement issued at this point would fail, with an error message, since no SQL can be performed before a COMMIT or ROLLBACK is issued.

ROLLBACK; 

After this ROLLBACK, a SELECT statement will succeed:

SELECT * FROM V$PQ_SESSTAT; 

Parallel update likewise requires commit or rollback immediately afterwards:

UPDATE   /*+ PARALLEL(table1,2) */ table1 

  SET col1 = col1 + 1; 

COMMIT; 

SELECT * FROM V$PQ_SESSTAT; 

As does parallel delete:

DELETE   /*+ PARALLEL(table3,2) */ FROM table3 

  WHERE col2 < 5; 

COMMIT; 

SELECT * FROM V$PQ_SESSTAT; 




Prev

Next
Oracle
Copyright © 1997 Oracle Corporation.
All Rights Reserved.

Library

Product

Contents

Index