Phase II

[ Overview - Phase I - Phase II - Phase III ]

For Phase II, you will write a program which will accept queries from the user and search for documents using the data structures produced in Phase I. You will choose a retrieval model from those discussed in class (e.g. boolean, vector space, probabilistic), and implement the inverted search algorithm using the model to rank the documents.

Your search interface must allow the user to:

Input a search query. This query may be a boolean expression or a free-text query, depending on your model.
View a list of search results. Each result should display an internal document ID (sequence number), a title or URL, and possibly a snippet of text from the document.
Choose a document from the list to view. This should fetch the document from where it is stored and display it to the user.

In other words, implement the basic, single-query-and-result-list search process. Beyond these central requirements, you are free to make design and interface decisions as you see fit.

You may implement any model you choose, I only ask that you plan your approach with an eye toward effective retrieval. If you choose to do a Boolean approach, you may want to consider how you might choose to rank the result set that satisfies the query expression. For a vector space model, you need to consider carefully which weighting function to use. For probabilistic and some vector space models, there are tuning parameters which need to be set for your collection. You will want to test several queries of your own to get a sense of how well your algorithm is performing. Feel free to refer to the papers cited in class or in the reading for tips.

Milestones for Phase II (with target dates):

(10/25) Implement a model and inverted search algorithm
(11/1) Take a query interactively and display ranked results. Allow the user to select documents to read from the list.
(11/8) Take a file of queries in a batch fashion, and output the ranked list of documents for each query. Evaluate the rankings with an evaluation package.

Benchmarks for Phase II:

A. Process a user query

Take a query of at least five words interactively from the user, retrieve the top 100 documents from the collection, show the top 20 document identifiers to the user with scores, and let the user choose one to display. Report the amount of time needed for the entire interaction, from after the user first enters their query to when they can see a full document on the screen.

What is the query that you used?
How long (wall-clock time) did it take to search the collection, display the top 20, allow the user to select one, and display the selected document?

B. Handle stock queries of different lengths

Choose a topic from the topics file for your chosen collection. From that topic, create one query for each of (a) the title section, (b) the description section, and (c) the narrative section of the topic. Your queries should be as complete as possible given the information in that topic section and your query language (Boolean operators, +/- clauses, range operators, etc). You will record the time required to rank 100 documents for each query, using either a stopwatch or timing functions within your code.

What topic did you choose (e.g. "101: Economic espionage")?
What is your query based on the "title" section of the topic?
How long does your system take to rank the top 100 documents for this query?
What is your query based on the "description" section of the topic?
How long does your system take to rank the top 100 documents for this query?
What is your query based on the "narrative" section of the topic?
How long does your system take to rank the top 100 documents for this query?

C. Evaluate ranking effectiveness on stock search topics

Use your program to automatically index the topics for your collection, or you may create queries by hand if you wish. Your system should rank the top 100 documents for each query and collect them into a "TREC top results file", the format of which is described in the handout on trec_eval from class as well as the file trec_eval.README in the data directory. You will then run the trec_eval program on your "top results" file and the qrels file for your collection to produce evaluation measurements on your results.

Did you create your queries automatically or manually? Describe how your queries were built.
Run your queries and collect the ranked lists produced by your system in the TREC "top results" format. The easiest way to do this is to write a script or harness for your program which runs the queries automatically and saves the output into a file.
Submit this file with the name "phase2.results.ian", replacing "ian" with your username, using the Blackboard "digital drop box" facility.
Run trec_eval on your results file. The command will be something like "./trec_eval umbc-qrels phase2.results.ian". The trec_eval binary (for Linux) is in the data directory. The second argument is the relevance judgments file for your collection, either "umbc-qrels" or "rcv1-600MB-qrels". The third argument is your results file. What is the output of trec_eval?