Discussion 5 handout
Group members (names & NetIDs)
Objectives
- Explore ADTs (interfaces) and Data Structures (classes) in the context of Java’s Collections library
- Select and compose ADTs
- Evaluate implementations of ADTs for efficiency in a particular use case
Preparation: Demo code and example data
Please download dis05-release.zip, extract it to a known location on your computer, and open it as a project in IDEA.
Task: Text file index
Your goal is to create a reverse index of a text file. For each line in the file, you want to keep track of the distinct words that occur on that line. Then using this index, you can report which lines a given word occurs on.
This problem can be solved with fairly little code by leveraging Java’s built-in implementations of appropriate ADTs. By solving the problem in terms of ADT operations (rather than jumping straight to code), it is easier to verify that your solution will work, and the same algorithm can readily be implemented in multiple languages.
Identify ADTs
The primary operation of our index is to report which distinct words appear on a given line. What would be an appropriate return type for this query?
Its secondary operation is to report which line numbers a given word occurs on. What would be an appropriate return type for this query?
Select a combination of ADTs (from those supported by Java’s collections framework) that would be suitable for storing the information needed to create the index. Specify their generic types.
Construct the index
In “Index.java”, declare fields of the types you selected above. Initialize these fields by constructing instances of appropriate classes from the Java collections framework.
Implement the constructor for Index
to populate these fields.
Query the index
Read the specification for wordsOneLine()
, then implement it using your fields. Note the word “creates” in the spec. Depending on your choice of fields, it might be possible to return an existing collection from your index’s state instead of creating a new one; why do you think creating a new collection is important?
Read the specification for linesWithWord()
, then implement it using your fields.
Look at main()
, which constructs an index for the included file “Austen.txt”. Add additional code to answer the following questions about the resulting index and check your results with a neighboring group or with a consultant:
-
How many dictinct words appear on line 18?
-
Does the word “young” appear on line 13?
-
How many different lines does the word “with” appear on?
Submission
- Open the assignment page for “Discussion activity 5” in CMSX
- [Recorder] Find the “Group Management” section and invite each group member
- [Others] Refresh the page and accept your invitation
- [Recorder] Take a picture of your work and save as either a JPEG or a PDF file named “discussion_responses” (you do not need to submit your test code). After all invitations have been accepted, upload your picture along with your code as your group’s submission.
- Recommended scanning apps: Microsoft Office Lens, Adobe Scan, Genius Scan, Evernote Scannable
Ensure that your group is formed and your work submitted before the Friday evening deadline.
Tips and reminders
- Discussion is not a race. Focus on the current activity being facilitated by your TA and engage your whole group to propose and explain ideas.
- Elect a recorder to maintain the “master” copy of your work (others are still encouraged to jot down ideas on scratch paper). Rotate this position each week.
- It is a violation of academic integrity to credit someone for work if they were not present when the work was done, and the whole group is accountable. Your CMS group must only include classmates who attended section with you on that day. Remember that our participation policies accommodate occasional absences without penalty.
- It is your individual responsibility to ensure that you are grouped on CMS and that your group’s work is submitted before the deadline. Log into CMS the day after section to ensure that everything is in order, and contact your groupmates if not. It would be prudent for multiple members to photograph the group’s work.
- Only one group member (typically the recorder) needs to upload a submission. But their submission must not be uploaded until after all group members have confirmed their membership in CMS (contact your TA if you have trouble grouping in CMS).
Challenge problem
If your entire group finishes early and are looking for more practice with this material, try completing this additional task. To be clear, we are not expecting anyone to submit work on challenge problems to CMSX, nor will your TA have time to discuss them in section. But you are welcome to discuss them on Ed or to bring questions and ideas to office hours.
Write a program that determines which words are contained in any files listed in its command-line arguments. For each word in any of the files (in alphabetical order), list which files contain that word (also in alphabetical order), and for each such file, list every line number (in numerical order, no duplicates) that the word appears on. Ignore capitalization of letters in words.
Example (partial) output from running java Indexer Tolkien.txt Austen.txt
:
assistance
* Austen.txt: 1
like
* Austen.txt: 16
* Tolkien.txt: 1, 2, 3, 4, 5, 6, 7
ring
* Tolkien.txt: 1, 3, 7
Bonus task: If a word appears on three or more consecutive lines, collapse them into a range in the output.
That is, instead of writing 1, 2, 3, 4, 5
, write 1-5
.
(This is very similar to an actual job interview problem, except that in the original problem, you would also have to implement all data structures and algorithms from scratch. Good code style, documentation, and test coverage were also assessed as part of the interview.)