5
Understanding the ConText Data Dictionary

This chapter introduces the concepts necessary for understanding the objects in the ConText data dictionary.

The following topics are discussed in this chapter:

The ConText Data Dictionary

ConText utilizes a data dictionary, separate from the Oracle data dictionary, for storing indexing options. The following objects are registered in the ConText data dictionary:

Tiles
preferences (default and user-created)
policies (template and user-created) and the table and column to which each policy is assigned
sources for automated batch-loading of text into database columns

The ConText data dictionary also stores resource limits and the status of all ConText servers that are currently running.

The ConText data dictionary is owned by the Oracle user CTXSYS. CTXSYS and the data dictionary tables and views are created during installation of ConText.

Categories

The indexing options that must be specified for ConText are divided into seven functional categories (classes):

Each category contains one or more Tiles for which you specify attributes when creating preferences.

See Also:

For more information, see "Tiles" and "Policies" in this chapter.

Data Store Category

The Tiles in the Data Store category are used to create preferences which specify how text (data) is stored in the database.

The Data Store category contains separate Tiles for each of the three types of storage supported by ConText:

direct
external
master/detail

For external data store, two Tiles are provided to support the two methods for storing text externally:

text stored as file names (accessed through the file system)
text stored as Web files (accessed through the file protocol or HTTP)

See Also:

For more information about text storage, see "Text Storage" in Chapter 4, "Text Concepts".

Filter Category

The Tiles in the Filter category are used to create preferences which determine how text is filtered for indexing and highlighting. Filters allow word processor and formatted documents, as well as ASCII and HTML text documents, to be indexed and highlighted by ConText.

For formatted documents, ConText stores documents in their native format and uses filters to build temporary ASCII versions of the documents. ConText indexes the temporary ASCII text of the formatted document. ConText also uses the ASCII version to highlight query terms.

The following internal filters are provided by ConText:

Filter	O/S	Version
Plain text (ASCII)	N/A	N/A
HTML	N/A	1, 2, 3
Autorecognize	N/A	N/A
AmiPro	Windows	1, 2, 3
Lotus 1-2-3	MS-DOS	4, 5
Lotus 1-2-3	Windows	2, 3, 4, 5
Microsoft Word	Windows	2, 6.x, 7.0
Microsoft Word	MS-DOS	5.0, 5.5
Microsoft Word	MAC	3, 4, 5.x
WordPerfect	Windows	5.x, 6.x
WordPerfect	DOS	5.0, 5.1, 6.0
Xerox XIF	UNIX	5, 6

In addition, ConText allows users to specify external filters for filtering documents in formats not supported by the internal filters provided with ConText.

External filters can also be used to perform operations, such as cleaning up or converting text, before the text is filtered for indexing and highlighting.

See Also::

For more information about internal and external filters, see "Text Filtering" in Chapter 4, "Text Concepts".

For more information about creating Filter preferences, see "Creating a Stoplist Preference" in Chapter 6, "Setting Up and Managing Text".

Lexer Category

The Tiles in the Lexer category are used to create preferences which specify the lexer used to perform indexing.

The lexer is the component that parses text and breaks it up into tokens for indexing. English and most European languages can use the same lexer because tokens (words) in those languages are delimited by blank spaces and standard punctuation (comma, period, question mark, etc.).

Japanese, Chinese, and many other Asian languages are pictorial (multi-byte) languages that cannot be tokenized in the same manner as English (single-byte). One common retrieval method for these languages is a dictionary-based lexer. The picture symbols used in the text are matched against a dictionary of known words to determine the tokens.

Single-Byte Languages

ConText includes a single Lexer Tile for all of the non-pictorial languages, such as English and the other European languages, supported by ConText. The basic lexer also works with languages such as Greek, which have different alphabets, but still utilize blank spaces to delimit words.

Multi-Byte Languages

ConText includes two Lexer Tiles for processing Japanese and Chinese text. The Japanese and Chinese V-Gram (Variable Grammar) lexers do not rely on finding token boundaries within text; instead, they uses a list of terms to match and index patterns of characters at user-specified, variable points of length.

The Japanese and Chinese lexers also work with languages that use a 7-bit character set, such as English. As a result, ConText supports indexing and querying Japanese and Chinese text that also contains English text.

Note:

Languages that use an 8-bit character set, such as many of the European languages, are not supported by the Japanese and Chinese lexers.

ConText also includes a lexer for Korean text. The Korean lexer works similarly to the Japanese and Chinese lexers by finding character patterns in the text and matching the patterns to a dictionary of terms. However, due to the significant morphological transformations that Korean verbs undergo, the Korean lexer only indexes nouns and noun phrases.

Note:

The Chinese and Korean lexers are provided with a status of BETA in this release of ConText.

Theme Lexing

For English-language text, a lexer is provided for creating theme indexes. This lexer breaks text into tokens; however, the tokens are not stored in the theme index. The tokens are passed to the ConText linguistic core where they are analyzed within the context of the sentences and paragraphs in which they appeared to determine whether they are content-bearing words. The linguistic core then generates themes, which are stored in the theme index.

The themes generated by ConText are based on, but are not identical to, the content-bearing tokens in the text.

See Also:

For more information about the theme lexer and theme indexing, see "Theme Indexes" in Chapter 4, "Text Concepts".

Engine Category

The Tiles in the Engine category are used to create preferences which specify how indexes are created by the ConText engine and where in the database the indexes are stored.

The engine is the ConText component that creates a ConText index for a text column. A ConText index is required before text in a column can be queried.

See Also:

For more information about creating Engine preferences, see "Creating an Engine Preference" in Chapter 6, "Setting Up and Managing Text".

WordList Category

The Tiles in the Wordlist category are used to create preferences for enabling three different ConText query expansion methods:

Stemming
Fuzzy Matching
Soundex

See Also:

For more information about expanding queries and the query expansion operators provided by ConText, see Oracle8 ConText Cartridge Application Developer's Guide.

Stemming

Stemming expands a query by deriving variations (verb conjugation, noun, pronoun, and adjective inflections) of the search token(s) in the query.

For example, a stem search on the verb buy expands to include its alternate verb forms, such as buys, buying, and bought, but not on the noun buyer. A search on the noun buyer would expand only to include its plural form buyers.

Since different languages have different stemming rules, stemming is language-dependent and uses wordlists that define the relationships between the words in a given language

ConText provides a stemmer, licensed from Xerox Corporation, that utilizes Xerox Lexical Technology to support inflectional stemming in the following languages:

English (inflectional and derivational)
Dutch
French
German
Spanish
Italian

Fuzzy Matching

Fuzzy matching expands queries by including terms that are spelled similar to the search token in the query. This type of expansion can be useful in queries for text that contains frequent misspellings or has been scanned using OCR software.

For example, a fuzzy matching query for the term cat expands to include cats, calc, case.

The number of expansions generated by fuzzy matching depends on the tokens that ConText identified during indexing; results can vary significantly according to the tokens that were identified and indexed by ConText for the column. As such, fuzzy matching depends on how tokens are delimited in a given language.

Note:

Fuzzy matching is designed primarily for English-language documents, but can be used, with varying degrees of success with most of the Western European languages.

Soundex

During text indexing of a column, Soundex, if enabled, creates a list of all the words that sound alike and assigns one or more IDs to each word to identify the other words in the list that sound like the word.

Note:

Soundex is designed primarily to look for matches in phonetic spellings used in English, but can be used, with varying degrees of success with most of the Western European languages.

The Soundex word list is stored in the DR_nnnnn_I1W ConText index table, where nnnnn is the identifier of the policy for the text index.

If Soundex is enabled for a text column, users can call Soundex in a query to expand the query. Soundex expands a query by searching the I1W table for terms that sound similar to the specified query term.

For example, a Soundex search on the name Smith would also find the names Smythe and Smit.

Note:

Soundex in ConText uses the same algorithm as the SOUNDEX function in SQL.

For more information about the SOUNDEX function in SQL, see Oracle8 Server SQL Reference.

Stoplist Category

The Tiles in the Stoplist category are used to create Stoplist preferences. A stoplist is a list of common terms that ConText does not include in the text index for a text column.

Each stoplist can contain a maximum of 4095 words.

See Also:

For more information about creating Stoplist preferences, see "Creating a Stoplist Preference" in Chapter 6, "Setting Up and Managing Text".

Tiles

Tiles are the objects in the ConText data dictionary that provide ConText servers with information about how text is managed in the system, as well as indexing instructions. Each Tile specifies a distinct indexing option within the ConText framework.

A Tile is the main component of a preference. When you define a preference, you specify a Tile and attributes for the Tile, as well as a value for each attribute.

Tiles are grouped into categories which identify the action performed by the Tile. There are two types of categories:

Indexing Categories

Tiles are grouped into categories which identify the type of indexing option for which the Tile is used.

The Data Store category contains the following Tiles:

DIRECT (documents stored in database as single rows)
MASTER DETAIL (documents stored in database as multiple rows, indexed as single rows)
OSFILE (documents stored in local operating system files, file names stored in database)
URL (documents stored in remote or local files, URLs for files stored in database)

The Filter category contains the following Tiles:

FILTER NOP (unformatted, plain text)
HTML FILTER (unformatted, plain text with HTML tags)
BLASTER FILTER (formatted text, plain text, and plain text with HTML tags)
USER FILTER (text processed through an external filter)

The Lexer category contains the following Tiles:

JAPANESE V-GRAM LEXER
KOREAN LEXER (BETA)
CHINESE V-GRAM LEXER (BETA)
BASIC LEXER (lexer for all other supported languages)
THEME LEXER (lexer for creating theme indexes)

The Engine category contains a single Tile, GENERIC ENGINE.

The Wordlist category contains a single Tile, GENERIC WORD LIST, which is used to enable Soundex and specify stemming and fuzzy matching methods (language-dependent).

The Stoplist category contains a single Tile, GENERIC STOP LIST, which is used to specify all the words that should not be indexed for the text column.

Text Loading Categories

The text loading categories identify the type of loading option for which the Tile is used.

The Reader category contains the DIRECTORY READER Tile, which is used to specify the directory where files to be loaded are stored.

The Translator category contains the following Tiles, which are used for translating files into the load file format required for text loading:

NULL TRANSLATOR
USER TRANSLATOR

The Engine category contains the GENERIC LOAD ENGINE Tile, which controls how the loading is performed.

Tile Attributes

Each Tile may have none, one, or many attributes that are specified to define a preference. The attributes identify which indexing options are active for the Tile in a preference.

Each Tile attribute has a value (either a number or a string) that you assign when you specify attributes in a preference.

Preferences

Preferences are created by users and are used to specify the options that ConText uses to index and load text (in batch mode). Each preference represents one (and only one) indexing/text loading option.

Preference Components

Each preference consists of:

a ConText Tile
one or more attributes (and their corresponding values) for the Tile

Each preference is grouped into a category, based on the indexing operation that the preference controls. While a category is not explicitly assigned to a preference, it is implied through the association of the Tile with the preference.

Predefined Preferences

During installation, ConText creates a number of preferences for each category. These predefined preferences can be used by any ConText user with the CTXAPP role to create template and column policies without first creating preferences.

See Also:

For a complete list of the predefined preferences provided by ConText, see "Predefined and Default Preferences: Indexing" in Chapter 10, "ConText Data Dictionary".

Default Preferences

ConText provides a default preference for the predefined preferences in each category. If a preference for a category is not specified in a policy or source, ConText assigns the policy/source the default preference for the category.

Note:

To create a policy that uses all of the default preferences, define the policy without specifying any preferences.

See Also:

For a complete list of the predefined preferences provided by ConText, see "Predefined and Default Preferences: Indexing" in Chapter 10, "ConText Data Dictionary".

Policies

A policy is a logical grouping of six indexing preferences (one preference for each of the supported categories), assigned to a column in the database. A policy specifies the options used by ConText to create the index for the text in the column.

Note:

A policy must exist for a column before a ConText server can create a index for the column.

Policies can be created by any ConText user with the CTXAPP role. Policies are stored in the ConText data dictionary. In addition to the preferences for a policy, users specify a name for the policy and the text column for the policy, and a number of other policy attributes.

The policies created by a user must be unique for the user. As such, the same policy for a user cannot be assigned to more than one column.

Policy Examples

Consider a table with two text columns: one holds Microsoft Word documents and the other holds Comments for the documents. The table structure is:

Table:DOC_AND_COMMENT
Columns:  TEXTKEY  NUMBER (unique primary key)
          TEXTDATE DATE
          AUTHOR   VARCHAR2(50)
          COMMENTS VARCHAR2(2000) (text column storing ASCII text)
          LONG RAW (text column storing MS Word documents)

To create a text index for both the comment and doc columns in doc_and_comment, a policy must be defined for each column. The following example illustrate two policies named i_doc and i_comments that could be created:

Policy 1 Name: I_DOC
Text Column:   DOC_AND_COMMENT.DOC
Engine:        General Purpose Engine
Filter:        MS-Word
Lexer:         General Purpose Lexer
Data Store:    Direct (text in column)
Word List:     Soundex and stemming

Policy 2 Name: I_COMMENTS
Text Column:   DOC_AND_COMMENT.COMMENTS
Engine:        General Purpose Engine
Filter:        None (ASCII text)
Lexer:         General Purpose Lexer
Data Store:    Direct (text in column)
Word List:     * none *

In addition, to create a theme index for the doc column, a theme indexing policy must be defined for each column. The following example illustrates a policy named i_theme that could be created for the table:

Policy 1 Name: I_THEME
Text Column:   DOC_AND_COMMENT.DOC
Engine:        General Purpose Engine
Filter:        MS-Word
Lexer:         Theme Lexer
Data Store:    Direct (text in column)
Word List:     Soundex and stemming

Multiple Policies on a Column

Multiple policies, as long as they are unique for the user, can be assigned to a column. As a result, a column can have more than one index.

When a query is performed, you can specify a policy name to indicate the index that is used to process the query.

This feature is particularly useful if you have English-language documents for which you want to enable both text and theme queries. To enable text and theme queries, you must create two separate policies on the column containing the documents and index the column once for each policy.

Policy Attributes

To define a policy, a user specifies a name for the policy and a number of optional attributes.

Policy Name

Because a policy is owned by the user who creates it, the policy name must be unique for a user; however, different users can have policies with the same name.

Optional Attributes

The following policy attributes are optional:

Text Column

The column in a table to which a policy is assigned. It is the column used to store text in the table.

Note:

If the policy does not include a text column, the policy is a template policy, which can be used as a source policy in another policy.

Description

The description of the policy.

Textkey

The primary key column or columns (up to sixteen) for the table. This attribute is required if the policy is being assigned to a column.

Line Number

The column storing the unique identifier for the text column in a master-detail table. A master-detail table does not store a document as a single row, but rather breaks the document (identified by the textkey) into sections and stores each section in a separate row in the table. The collection of rows with the same textkey represents the whole document.

This attribute is used only for policies that include a preference for the MASTER DETAIL Tile.

Source Policy

An existing template policy that you want to use as the basis for a new policy. When you specify a source policy in a policy, all of the preferences for the template (source) policy are copied into the new policy. The preferences from the source policy can be overwritten by explicitly specifying a preference for the category.

Note:

When specifying a source policy in a policy, a user can specify either their template policies or CTXSYS-owned template policies.

Preferences in Policies

To define a policy, the user specifies a preference for each of the six supported categories. ConText does not require the user to specify a preference for the seventh category, Compressor, because data compression is not currently supported.

A preference can be used in more than one policy; however, two preferences from the same category cannot be used in the same policy.

Note:

If you want to use the same preferences for two text columns, you must create two separate policies. The policies will be identical (having all of the same preferences), but they must have unique names and be attached to different columns. This is true whether the columns are in the same table or in different tables.

Preference Defaults

In a policy, if a user does not specify a preference for one of the preference categories, ConText uses the default preference for the category.

illustrates how the default preferences and user-specified preferences work together to create a complete policy.

Template Policies

A template policy is a policy for which a text column was not specified during creation. Template policies are stored in the ConText data dictionary and are owned by the user who created them.

A template policy can be used by the policy owner as a source policy when creating new policies. In addition, ConText provides a number of default policies owned by CTXSYS that can be used by all users.

When a template policy is used as a source policy in a new policy, all of the preferences for the template policy are copied to the new policy. Any preference from the template policy can be overridden by explicitly naming a preference (for the same category) during the creation of the new policy.

Sources

A source is a logical grouping of three text loader preferences (one preference for each of the supported categories), assigned to a column in the database. A source specifies the options used by ConText to load text automatically into a column using ctxload and ConText servers with the Loader personality.

Note:

A source must exist for a column before a ConText server with the Loader personality can load text from an operating system file into the column.

Sources can be created by any ConText user with the CTXAPP role. Sources are stored in the ConText data dictionary. In addition to the preferences for a source, users specify a name and text column for the source. Users can also specify a description and a refresh rate for directory scanning.

The sources created by a user must be unique for the user. As such, the same source for a user cannot be assigned to more than one column.

Source Attributes

To define a source, a user specifies a name and column for the source, and optionally a description and refresh rate for the source.

The column in the source indicates the column to which text is loaded by ConText servers.

Note:

The column must be a LONG or LONG RAW column, because load servers only supports loading text columns with these datatypes.

Preferences in Sources

To define a source, the user specifies a preference for each of the three supported categories.

A preference can be used in more than one policy; however, two preferences from the same category cannot be used in the same policy.

If a preference for one of the categories is not specified when the source is created, the default, predefined preference for the category is used in the source.

Note:

All three of the loading categories have defaults; however, the default preference for the Reader category should not be used. This is because the directory specified in the default Reader preference is a generic directory specification and probably does not exist in your file system.

5 Understanding the ConText Data Dictionary

The ConText Data Dictionary

Categories

Data Store Category

Filter Category

Lexer Category

Single-Byte Languages

Multi-Byte Languages

Theme Lexing

Engine Category

WordList Category

Stemming

Fuzzy Matching

Soundex

Stoplist Category

Tiles

Indexing Categories

Text Loading Categories

Tile Attributes

Preferences

Preference Components

Predefined Preferences

Default Preferences

Policies

Policy Examples

Multiple Policies on a Column

Policy Attributes

Policy Name

Optional Attributes

Text Column

Description

Textkey

Line Number

Source Policy

Preferences in Policies

Preference Defaults

Template Policies

Sources

Source Attributes

Preferences in Sources

5
Understanding the ConText Data Dictionary