Phase I

[ Overview - Phase I - Phase II - Phase III ]

Phase I of your project will read a set of documents, parse them into documents and terms, and produce an inverted index and associated data structures which will be stored on disk and used in Phase II.

This phase will have two major components: the lexical analyzer and the inverter. For the lexer, you might choose to use code you wrote for Homework 1. You will need to make explicit what assumptions you make about the structure and words in the documents. You might choose to have your lexer configurable at run-time; the configuration file would then specify how to segment terms, what tag indicates the start of a new document, how to treat numbers, etc.

For the part of your program produces the inverted file, you will want to think carefully through your choice of data structures and algorithms. Use the material from the textbooks and readings distributed in lecture for this. You may choose to use the common libraries available on the UNIX systems; if you do, make sure you say so in your documentation!

Your program will need to save several data structures to disk. At the minimum, these will include the lexicon (the table of words occurring in the text, appropriate metadata, and pointers into the inverted file), the document list (a table indicating where to find the files on disk at retrieval time), and the inverted file itself. You do not need to store the documents locally unless you wish to.

Phase I resources

Here are some resources you might find useful in Phase I:

Stop Lists

These are some commonly-used stoplists. You may find that you want to edit them somewhat (for example, if you want the word 'C').

Stemmer

Martin Porter's web page contains implementations of the Porter Stemmer in a number of languages. Another implementation of the Porter stemmer comes from Andrew McCallum's Bag Of Words Library (1996), a toolkit for statistical language modeling, text retrieval, classification, and clustering.

Porter's Snowball Project has several stemmers in other languages. Snowball is a language for writing stemmers.

Milestones for Phase I (with target dates):

(9/27) Parse the files containing the documents into individual documents and words.
(10/4) Invert the collection in memory.
(10/11) Store the inverted file and associated data structures to disk.

Benchmarks for Phase I:

0. Which collection?

Which collection are you using for your project?

umbc-crawl-2
Reuters Corpus v1 (600MB subset)

A. Corpus statistics

Report the following collection statistics. If you compute these separately from your indexing process, make sure that both use the same document segmentation, term selection, etc.

How many unique words do you index from the collection?
How many documents are there in the collection? How many do you index?
How many postings (inverted file entries) do you create for this collection?
What are the lengths of the shortest and longest inverted lists? What is the average inverted list length? (State your answers in terms of number of document entries per posting.)

B. Invert collection

Index the collection, and report:

Wall-clock time to index the collection. You can use the Unix time(1) command to get this information, if you like.
How much memory is needed to index the collection. Account for all memory used.
How much total disk space is required for your index and related data structures? This total should include everything created in your indexing process that is needed later to answer queries (lexicon, inverted file, document location table). If you create temporary files, report the maximum disk space needed at any point in your indexing process.