UMBC CMSC 491/691-I Fall 2002 Home  |  News  |  Syllabus  |  Project   ]
Last updated: 26 November 2002

Course Project

Overview  -  Phase I  -  Phase II  -  Phase III   ]

For the course project you will design and implement your own information retrieval system. The project will have three phases. In Phase I, you will build the indexing component, which will take a large collection of text and produce a searchable, persistent data structure. In Phase II, you will add the searching component, according to one of the models discussed in class. In Phase III, you will add some advanced functionality of your choice. The project is due at the end of week 15 (December 5th).

Phases I and II are required. They each consist of a small set of milestones which you should tackle in order. Each milestone has a target date; if you aim for the targets, you have a better chance of completing the entire project. To further encourage you to not put off the project until the last week of class (and who would do that?), there are a series of benchmarks, where you can test your system against everyone else's using some standard data sets.

Phase III is required of graduate students, and encouraged, but not strictly required of undergraduates. Phase III will be graded on what you complete. What does this mean? Suppose you submit your proposal for Phase III, only manage to get part of it working, and talk about your experience in your presentation at the end of class. You would not lose any credit for not getting it working completely. If your proposal is too ambitious, you get nowhere on it, and you handwave your way through a presentation, you will likely lose credit -- choose your goal wisely!

The milestone target dates assume that you will need time to do Phase III.

The project may be done individually, or in small groups (no more than three, please). Groups must do Phase III. All members of a group are expected to contribute as equally as possible to all aspects of the project: design, implementation, documentation, and testing. Groups should hand me (email ok) a list of their group members by Thursday, September 12.

Deliverables

You will hand in, and your grade will be based on four items:

1. A design document (20%)

This should be written prior to writing code, although it will certainly change during the course of the project. The design document will clearly describe the overall design, all the components of your program, and how they interact with each other. Do not simply give a list of functions or classes and methods; the design document should be complete enough that one of your classmates could implement your program from it.

Any and all external resources that you use for your project, such as libraries (e.g., Berkeley DB), code (e.g., Porter's stemmer code), stop word lists, etc. must be documented in the design document.

2. Benchmark output (50%)

The output of your program for each phase benchmark. The specific output required is discussed in each benchmark. This output will be submitted via a BlackBoard "quiz".

3. Project diary (10%)

Write a careful summary of your experience, problems you encountered, solutions you discovered, things you wish you had known. This should be in the form of a journal or log of events that occurred while working on the project, where you talk about what problems came up and how you solved them. Groups should especially include descriptions of their meetings, what was discussed and decided.

Parts 1, 2, and 3 will be handed in together as a single Project Report, in hard copy, in class on the project due date.

4. Code and accompanying documentation (20%)

Your code needs to be well documented with comments. You need to include a README file that describes how to compile your program and run the benchmarks with it. The README should indicate which CS system(s) your project is known to run on: Linux, Solaris, or IRIX. You will submit this as a .tar.gz file using BlackBoard's "digital dropbox" feature.

Implementation

You may code your project in the language or languages of your choosing as long as I know it (I'm not learning a new programming language just to understand your project ;-). If you're not sure about your choice, see me. If you're not sure what to choose, I recommend C, C++, or Java.

Your project must run on one of the CS central servers, such as linuxserver1. This is essential because the test collections are mounted on those machines. You may not copy the collection off of the central computers; for some of the collections we have signed usage agreements under which we may not redistribute the documents.

Collections

Your project must work with either one of two collections: