Loading [MathJax]/jax/output/HTML-CSS/jax.js

A3: Search

Deadline: Wednesday, 3/28/18, 11:59 pm

This assignment may be done as an individual or with one partner. The use of a partner is optional. If you partner with someone outside your recitation, please alert both TAs in advance, so that they can coordinate on who is grading your submission. Sharing of code is permitted only between you and your partner; sharing among larger groups is prohibited. Your recitation TA will devote some recitation time to helping connect you with potential partners.

Warning about academic integrity. This assignment involves implementing some data structures. In the vast expanses of the Internet, there might be an implementation of every known data structure in every known programming language. So it's conceivable you could just Google for the solutions to certain parts of this assignment. Well, we've already done that ourselves and put the results in MOSS, the plagiarism detection software we use. If you do steal code from somewhere and fail to cite it, that will be regarded as an especially heinous violation of academic integrity. If you do cite, the grade penalty will be severe. (Though of course not as severe as violating the code of academic integrity.) In previous years, there were over a dozen AI violations on this assignment, mostly because students Googled for balanced tree implementations in OCaml and re-used code they found in a functional programming course at another Ivy-League university. Don't search for solutions. Just don't. Once you see someone else's solution, even just some types or signatures of helper functions, it's impossible to unsee it; you will almost certainly write code that is too similar to theirs even if you do not directly copy it.

Table of Contents:

Introduction

In this assignment, you will develop a search engine for text documents. The prototypical kind of document you should have in mind is a book from Project Gutenberg. Though that project offers books in many formats, the ones you should look at are the Plain Text versions, such as the one labeled "Plain Text UTF-8" of Alice's Adventures in Wonderland by Lewis Carroll.

Your engine will crawl through a directory on a local disk looking for documents. When it finds them, it will index the words that appear in those documents. It will then answer queries posed by users. A query will involve words that might appear in the documents, and the response to a query will be a list of documents that involve those words.

The index maintained by your search engine will make use of lists, trees, dictionaries, and sets. To get your engine working at first, you will implement a simple dictionary based on association lists. That implementation will be too inefficient to index large documents, so you will later implement a dictionary using AVL trees. You will likewise implement sets at first with an inefficient list representation, then later with an efficient AVL tree representation.

The queries that users pose will have one of two forms. Abstractly those two forms are as follows, in which the NOT clause is always optional:

For example, "AND (far, away)" would return all documents that contain both the words "far" and "away", whereas "AND (far, away), NOT (day, night)" would return all documents that do contain both "far" and "away" but do not contain "day" nor "night". Likewise, "OR (all, nothing)" would return any document that contains "all" or "nothing" (or both), and "OR (all, nothing), NOT (between)" would return any document that contains "all" or "nothing" (or both) but does not contain "between".

You will also build a test suite for your search engine, and as part of that test suite, you will build a test harness that could be used to test any dictionary, not just your own. The course staff will run your test harness against some faulty dictionary implementations to see whether your harness detects the faults in them.

All of the above tasks will require extensive use of the OCaml module system. The release code mostly just contains a few interface files; your job is to write implementation files for those interfaces.

Assignment information

Objectives.

Recommended reading.

Requirements.

  1. You must implement the document crawler, indexer, and querier.

  2. You must implement two dictionary data structures, one with association lists and the other with AVL trees.

  3. You must implement producing a set data structure out of an arbitrary dictionary data structure.

  4. You must build a test harness for dictionaries.

  5. Your code must be written with good style and be well documented. All functions, except for nested helper functions, must have specification comments. Representation types for data structures must have a documented abstraction function, and any representation invariants must also be documented.

  6. Your code must avoid duplication of implementations. There is some duplication of type declarations that is unavoidable; see Part 1 below.

  7. You must submit an overview document.

  8. You must use Git as part of your development process.

What we provide.

In the release code on the course website you will find these files:

Software note.

The Makefile relies on the QuickCheck package. You can install it by issuing the following command:

opam install qcheck

from the terminal.

Grading issues.

Prohibited OCaml features. You may not use imperative data structures, including refs, arrays, mutable fields, and the Bytes and Hashtbl modules. Strings are allowed, but the deprecated (mutable) functions on them in the String module are not. Your solutions may not require linking any additional libraries/packages beyond those in the provided _tags file. You may and in fact must use I/O functions provided by the Pervasives module and the Unix module, even though they cause side effects, to implement your engine. All functions in the Str library (which is different than the String module) may be used, even though some of them do cause side effects. Your implementations of dictionaries and sets may not use the standard library Map or Set implementations. Your test suite is free to use them, if you wish, for the purpose of comparing the correctness of your implementations to a "known good" implementation.

Part -1: Git

You are required to use Git, a distributed version control system, whether you are working as an individual or with a partner. We recommend the following:

  1. Do this Git tutorial. Although that tutorial covers branches, they are an advanced feature that you do not need to use. For the most part, the only commands you need to know are pull, add, commit, and push. Atom has features built-in to support these commands.

  2. Use what you have learned to create a private git repo on the Cornell CIS GitHub. Throughout your development of A3, commit your changes to it. Use those checkins to provide checkpoints, in case you need to restore your development to a previous point.

  3. Synch with a remote repo to communicate code between you and your partner, or simply to backup your development if you are working as an individual.

You are strongly encouraged to use the Cornell CIS GitHub rather than the public GitHub (whose URL we deliberately omit here). The main reason for this is that the CIS GitHub allows unlimited private repositories.

Private repos are of the utmost importance. A public repo would share your code with the entire world, including your classmates, thus violating the course policy on academic integrity. Therefore we require that you keep all your CS 3110 related code in private repos. To create a private repository, make sure you select the "Private" radio button when creating a new repository.

Part 0: Makefile

As usual, we provide a makefile.

Although testing interactively in utop is often quite helpful, you should be aware that when you #use a .ml file, that is only textual inclusion of the file as if you had typed it into the toplevel yourself. The file is not treated as part of a compilation unit in that case, so it is not checked against any .mli file. You must compile the file with make (or at least with ocamlbuild) in order for the implementation to be checked against the interface. That's why the provided .ocamlinit has #load directives rather than #use directives. You will therefore need to recompile your code with make compile each time before launching utop.

Part 1: Read the release code

Your next task is to familiarize yourself with the structure of the release code we have shipped to you. Read each of the files in detail, noting carefully the specifications that are provided. A good order to do that would be engine.mli, engine.ml, data.mli, data.ml, test_main.ml, test_data.mli, test_data.ml, test_engine.mli, test_engine.ml. There are close to 500 lines of code and comments in those files. Make sure to start reading all of them right away! Don't put this off until the last minute.

You may not change the names and types in the provided .mli files. They are the interface you must implement, and they are the interface against which the course staff will test your submission. You may add new declarations and definitions. If you're in doubt about whether a change is permitted or not, just run make check: it will tell you whether you have changed the interface in a prohibited way.

One thing you will notice is that there is a certain amount of copying of a particular kind of code between interfaces and implementations: any type declaration must be repeated in both files. This is because if a .mli file contains a type declaration, the .ml must contain that declaration as well to implement the interface—just like if the .mli file declares a value, the .ml must define it. Though this might seem weird, it enables separate compilation of interfaces from implementations, which is actually a good thing.

Part 2: List engine

As discussed above, the job of your search engine is to index a directory of files and answer queries about the words appearing in those files.

For purposes of this assignment, we define a word as follows:

Yes, there will no doubt be some weird corner cases resulting from this definition of words. But we need a standard definition; this one is relatively simple for us all to use, and it gets many common cases right.

For example, given a file containing the following text:

  "I would found an institution where
   any person can find instruction in
   any study." ---E. Cornell (b. 1807)

The words in that file would be: "1807", "an", "any", "b", "can", "Cornell", "E", "find", "found", "I", "in", "institution", "instruction", "person", "study", "where", "would".

Searching. Searches must be case insensitive. For example, searching for either "Cornell" or "cornell" should return the file in the example immediately above.

Directories and files.

What to do. Implement Data.MakeListDictionary, Data.MakeSetOfDictionary, and Engine.ListEngine. Note that you might not need every one of the functions in Dictionary or Set to implement your engine, but they are required functions nonetheless. Also implement Test_data.tests to provide OUnit tests for your dictionary and set data structures, and Test_engine.tests to provide OUnit tests for your search engine. When you are done you will have a complete, working search engine. But its performance will be rather slow, because the data structures are not very efficient. That's okay for now.

Here are some implementation hints:

Part 3: Test harness

As part of developing your data structures, you naturally will be constructing test cases that demonstrate the (in)correctness of your dictionaries. Let's take that one step further. The course staff will develop several faulty implementations of Dictionary. Your task is to construct a suite of test cases that finds all our faults (i.e., bugs). The faults we introduce will be in one or more of insert, find, and remove. Of course, your tests are free to exercise other functions (which will be implemented correctly) in an effort to find the faults. The faults we introduce will be related to correctness, not performance. They will cause incorrect values to be returned; they will not raise exceptions.

The test cases you write should be black-box tests against the Dictionary interface. You don't know what representation types we might use for our faulty implementations, so focus instead on the behavior specified in the comments of that interface.

The functor Test_data.DictTester is where you should implement your test harness. The course staff will instantiate that functor on our own incorrect Dictionary implementations, as well as some correct implementations. The functor will produce your list of OUnit test cases. The staff's correct implementations must pass all those cases, and the staff's incorrect implementations must each fail at least one test case.

Your test harness should not rely on the presence of any directories of documents, including test1/ and test2/: those are for testing of your engine, not of your dictionaries per se.

To ease your mind, you need not worry about maliciously incorrect Dictionary implementations. By that we mean that our incorrect implementations will be of the sort that a programmer might create on accident or by misunderstanding, rather than being coded to fail only on negligibly-likely inputs—for example, only when the number 42 is inserted.

Part 4: Tree engine

Tackle this part of the assignment only after you have completed the previous parts.

The search engine you have implemented now works, but it is inefficient: you will notice that trying to index and search the provided test2/ directory, which is not overly large, takes a significant amount of time. The problem is that a list-based implementation of dictionaries and sets, while being relatively easy to implement (compared to what you're about to do next), offers poor performance: many operations are linear time.

Balanced search trees offer much better performance. Their operations are generally logarithmic time. An AVL tree is a kind of balanced search tree data structure in which the difference between the shortest and longest paths differ by at most one. Let's use them to implement an efficient dictionary.

To learn about AVL trees, read this handout by Prof. Lyn Turbak at Wellesley College. Read the handout. Seriously. Following that article, implement Data.MakeTreeDictionary with an AVL tree data structure.

Representation type. In data.mli, we provide a type ('k,'v) avl to represent AVL trees. This must be your representation type for MakeTreeDictionary, as already coded in data.ml. We provide this type not out of a desire to restrict your creativity, but because in the past students have been led astray by inventing far more complicated representation types. Really, this type is all you need to represent an AVL tree.

The function Data.Dictionary.expose_tree exposes the representation of dictionary as a AVL tree to clients of that data abstraction. This is not the best possible design choice for a library, but we made it so that the course staff could directly examine the trees your algorithms create. There are various places in the AVL tree algorithms where nondeterminism is possible: for example, you might choose to insert a new binding at multiple places in a tree. It is up to the course staff to build a test suite that allows any of those possible choices; you are free to choose any one of them.

Implementation.

The insert operation can be implemented by directly following the algorithm described in the handout, using rotations to rebalance the tree as needed after an insertion.

The remove operation is the hardest part of implementing AVL trees, so you could make that the very last thing you finish. Even though this operation accounts for about 20% of the code you might need to write, we expect remove on AVL trees to end up being worth only about 5% of your score on the entire assignment. So even though it is the hardest part of the assignment, you don't have to do it at all to get a high score. You must implement the remove algorithm that is described in the AVL tree handout, which has logarithmic time performance. Linear-time algorithms (e.g., convert the tree to a list, remove the element from the list, then create a new tree out of that list) will receive zero points.

Efficiency. When your AVL tree implementation is complete, try running your search engine on many large documents. How efficient is it? As a point of comparison, one (not particularly optimized) staff solution is able to index test2/ in about 2 minutes with the list engine, and in about 2.5 seconds with the tree engine.

What to turn in

Submit files with these names on CMS:

The sizes of the above files are limited in CMS to 1 MB, which is driven in part by the fact that there are 300+ students in this course each submitting many files. Please stay within the size alloted. In particular, the purpose of a3src.zip is not for you to submit test cases with large files. Feel free to describe such test cases in your overview document, but do not submit them. Rest assured that the course staff will test your submission on our own large files!

To produce gitlog.txt for submission: Run the following command in the directory containing your source code (e.g., test_main.ml):

$ git log --stat > gitlog.txt

Assessment

The course staff will assess the following aspects of your solution. The percentages shown are the approximate weight we expect each to receive.

Karma

Scale up:

How well does your search engine scale to large inputs? Make your implementation as efficient as possible while still being correct. We'll test on increasingly larger directories to see how quickly your engine indexes them. If there's a clear winner, we'll award up to 10% bonus points (beyond mere karma). Last year, the winner didn't attempt anything unusual: they just wrote really clean code that didn't do anything unnecessary.

Interface:

The search engine you built in this assignment doesn't have its own interface; instead, it relies on the toplevel. Build your own interface instead. It might be a simple textual interface or a graphical user interface, the latter being possible with LablGtk. It is fine for the purpose of implementing this functionality to provide additional instructions to the grader about packages to install, additional build commands to run, etc. But the code you submit must still pass make check and obey the original assignment specifications, so that your non-karma functionality can be graded with the rest of the submissions. (Sorry; in a 300+ student class, automation of grading is a requirement.)


Acknowledgement: Adapted from Prof. Greg Morrisett, Dean of Cornell CIS.