UMBC CMSC 201 Fall '00 CSEE | 201 | 201 F'00 | lectures | news | help 

CMSC 201
Programming Project Four

Concordance

Out: Monday 11/13/00
Due: Before Midnight, Sunday 11/26/00

The design document for this project, design4.txt
is due: Before Midnight, Sunday 11/19/00

The Objective

The objective of this assignment is to give you practice with project and function design. It will also give you an opportunity to work with reading information from a file, sorting an array of structures, passing structures by reference, manipulating strings, dealing with command line arguments and some formatted printing.

The Background

Analyzing text is one of the primary uses of computers. Text is analyzed to make searching faster, and for statistical analysis. This project will give you the opportunity to analyze some text and report your findings.

A concordance is an alpahbetical list of words from a passage of text together with the number of times that each word occurs in the text. Very often a list of line numbers in which each word appears is also provided, but is not required for this project.

The Task

Design and code a project that will allow you to read in the information from a text file, create a concordance and report on various statistics about the words in the text. To make your program easier, the text file will be entirely in lower case and there will be no punctuation marks in the file.

Your program will provide the followng information from your concordance.
See the sample output for a suggested format.

  1. The contents of the text file to be analyzed
  2. An alphabetical list of the of all words in the text, together with the number of times each occurs
  3. An alphabetical list of the words which occur most frequently and the number of times they occurs.
  4. An alphabetical list of the longest word(s), their length and the number of times each occurs.
  5. An alphabetical list of the shortest word(s), their length and the number of times each occurs.
  6. The average word length reported with one decimal place of precision.
Several test data files are available for you.  You can view these files and examine their content.


You should copy one or more of these files into your account by using the following commands:
(don't forget that there is a dot (.) at the end of the command)

cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/preamble.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/declaration.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/mary.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/rose.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/test.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/imagine.dat .

You should of course make your own test files as well.

The Specifications

Sample Run

Although your output need not look exactly like the sample output below, all information detailed in the specification above must be present. Your program must also print a short greeting. Don't be concerned if your output scrolls off the top of the screen. It will be very difficult to keep all output on a single screen.

There are two approaches to this problem...

  1. Use the unix script command to capture your output in a file (named typescript) in order to examine it.
    For more information on the script command, see the Unix man pages.
  2. Redirect the output of your program into a file using Unix redirection as discussed in class.

irix1[1]% a.out
 Usage: a.out <filename>

irix1[2]% a.out ppp.dat
 can't open ppp.dat

irix1[3]% a.out test.dat
The original text:
this is the test file for project four
this is the test
this is only the test
this is not real because it is the test

The concordance contains 13 words, listed below alphabetically
     because  1         file  1          for  1         four  1           is  5
          it  1          not  1         only  1      project  1         real  1
        test  4          the  4         this  4

The most frequent word(s) occurred 5 times:
          is

The longest word(s) had length 7 :
     because  1      project  1

The shortest word(s) had length 2 :
          is  5           it  1

Average word length is 3.5 characters
irix1[4]%
 

Submitting the Program

You are to use seperate compilation for this project, so you will be submitting a minimum of three files.
Your C source code file that contains main() MUST be called proj4.c. I would expect that you would also have files called
concordance.c and concordance.h, but you may choose to have additional .c and .h files.

To submit your project, type the following at the Unix prompt. Note that the project name starts with uppercase 'P'.

submit cs201 Proj4 proj4.c concordance.c concordance.h (and possibly other files, seperated by spaces)

To verify that your project was submitted, you can execute the following command at the Unix prompt. It will show all files that you submitted in a format similar to the Unix 'ls' command.

submitls cs201 Proj4


CSEE | 201 | 201 F'00 | lectures | news | help

Monday, 30-Oct-2000 14:53:34 EST