logoDiscussion 4 handout

Group members (names & NetIDs)

Objectives

Preparation: Demo code and example data

Please download dis04-release.zip, extract it to a known location on your computer, and open it as a project in IDEA.

Task: Text file index

Your goal is to create a reverse index of a text file. For each line in the file, you want to keep track of the distinct words that occur on that line. Then using this index, you can report which lines a given word occurs on.

This problem can be solved with fairly little code by leveraging Java’s built-in implementations of appropriate ADTs. By solving the problem in terms of ADT operations (rather than jumping straight to code), it is easier to verify that your solution will work, and the same algorithm can readily be implemented in multiple languages.

Identify ADTs

The primary operation of our index is to report which distinct words appear on a given line. What would be an appropriate return type for this query?

Its secondary operation is to report which line numbers a given word occurs on. What would be an appropriate return type for this query?

Select a combination of ADTs (from those supported by Java’s collections framework) that would be suitable for storing the information needed to create the index. Specify their generic types.

Construct the index

In “Index.java”, declare fields of the types you selected above. Initialize these fields by constructing instances of appropriate classes from the Java collections framework.

Implement the constructor for Index to populate these fields.

Query the index

Read the specification for wordsOneLine(), then implement it using your fields. Note the word “creates” in the spec. Depending on your choice of fields, it might be possible to return an existing collection from your index’s state instead of creating a new one; why do you think creating a new collection is important?

Read the specification for linesWithWord(), then implement it using your fields.

Look at main(), which constructs an index for the included file “Austen.txt”. Add additional code to answer the following questions about the resulting index and check your results with a neighboring group or with a consultant:

  1. How many dictinct words appear on line 18?

  2. Does the word “young” appear on line 13?

  3. How many different lines does the word “with” appear on?

Challenge problem

If your entire group finishes early and are looking for more practice with this material, try completing this additional task. To be clear, we are not expecting anyone to submit work on challenge problems to CMSX, nor will we have time to discuss them in section. But you are welcome to discuss them on Ed or to bring questions and ideas to office hours.

Write a program that determines which words are contained in any files listed in its command-line arguments. For each word in any of the files (in alphabetical order), list which files contain that word (also in alphabetical order), and for each such file, list every line number (in numerical order, no duplicates) that the word appears on. Ignore capitalization of letters in words.

Example (partial) output from running java Indexer Tolkien.txt Austen.txt:

assistance
* Austen.txt: 1
like
* Austen.txt: 16
* Tolkien.txt: 1, 2, 3, 4, 5, 6, 7
ring
* Tolkien.txt: 1, 3, 7

Bonus task: If a word appears on three or more consecutive lines, collapse them into a range in the output. That is, instead of writing 1, 2, 3, 4, 5, write 1-5.

(This is very similar to an actual job interview problem, except that in the original problem, you would also have to implement all data structures and algorithms from scratch. Good code style, documentation, and test coverage were also assessed as part of the interview.)