A3: Moogle

# A3: Moogle **Soft deadline:** Thursday, 10/08/15, 11:59 pm **Hard deadline:** Saturday, 10/10/15, 11:59 pm (Fall Break begins at 1:10 pm on 10/10/15. You are in no way required to work on A3 over Fall Break. It is up to you whether you choose to take the automatic extension until the hard deadline.) *This assignment may be done as an individual or with one partner. Sharing of code is permitted only between you and your partner; sharing among larger groups is prohibited. You can use the "Search for Teammates" feature on Piazza to find a partner.* In this assignment, you will develop a web search engine. Your search engine will be named **Moogle** in honor of our new dean, Greg Morrisett, and [his love of cows][cows]. [cows]: https://www.quora.com/Why-is-Professor-Greg-Morrisett-so-fond-of-cows ## Overview Moogle is designed with a client–server architecture. As with most search engines, the client is simply a web browser that connects over http to a server. You therefore won't have to worry about implementing the client or the connector. The Moogle server is implemented in OCaml. When it is first loaded, it is pointed to a *root page*. From that root page, it *crawls* looking for links to other pages. As it proceeds, it *indexes* the pages that it finds. After it finishes indexing, it is ready to process client requests. The Moogle client (i.e., home page) allows the user to enter a *query*. The query may be a word (e.g., "moo"), a conjunctive query (e.g., "moo AND cow"), or a disjunctive query (e.g., "moo OR cow"). By default, queries are treated as conjunctive. So "moo cow" is treated as "moo AND cow", hence returns pages that have both the word "moo" and the word "cow". The query is sent to the Moogle server, which computes the set of URLs that satisfy the query. For example, given the query "moo OR cow", Moogle computes the set of URLs of all indexed pages that contain either "moo" or "cow". Moogle builds an html page as a response and sends the page back to the client. Some of the Moogle server is already implemented for you. Your task is to implement the crawler and indexer. You will also need to implement efficient data structures for sets and dictionaries. For karma, you can implement a well-known algorithm that prioritizes search results. **Warning:** Because crawling the Internet demands careful attention to protocols that well-behaved search engines are expected to obey (in particular, `robots.txt`), this assignment is limited to crawling web pages on your local disk. We strongly recommend that you **do not direct your crawler to the Internet.** ## Objectives * Write code that makes extensive use of the OCaml module system, including structures, signatures, and functors. * Work inside a large(ish) code base that you did not write yourself. * Implement an efficient dictionary data structure using balanced search trees. * Practice writing programs in the functional style using immutable data. * Use Git, a distributed version control system. ## Recommended reading * [Lectures and recitations 8–11][web] * [RWO chapters 4 and 9][rwo] * [Handout on 2-3 trees][23tree] (also available from [CMS][cms]) [web]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/ [rwo]: https://realworldocaml.org/v1/en/html/index.html ## Requirements The primary requirements are the following: * You must implement the web crawler and indexer, a set data structure (implemented with dictionaries), and a dictionary data structure (implemented with 2-3 trees), as described below unders Parts 1 and 2. * You must complete the data-structure testing infrastructure in the release code, as described below under Part 2. * We must be able to compile and run your search engine as described below under "Compiling and running Moogle" in Part 0. * Your code must be written with good style and be well documented. * You must use Git (or another version control system) as part of your development process. ## What we provide In the release code you will find these files: * Many `.ml` files, which are described below under "Part 0: Understand the Moogle codebase". * A template file `a3.txt` for submitting your written feedback. ## What to turn in Submit files with these names on [CMS][cms]: * `a3src.zip`, containing your solution code. * `vclog.txt`, containing your version control log. * `a3.txt`, containing your written feedback. [cms]: https://cms.csuglab.cornell.edu/ **To prepare `a3src.zip` for submission:** From the directory that contains `moogle.ml`, bundle all your source code into a zip file with this command: ``` $ zip a3src.zip *.ml ``` Do not include any compiled bytecode files, otherwise your submission might become too big to upload. Double check that you got all your source files with this command: ``` $ zipinfo -1 a3src.zip ``` When run on the release code, the proper output would be: ``` crawl.ml dict.ml graph.ml moogle.ml myset.ml nodescore.ml order.ml pagerank.ml query.ml util.ml ``` **To produce vclog.txt for submission**: Run the following command in the directory containing `moogle.ml`: ``` $ git log --stat > vclog.txt ``` ## Git You are required to use [Git][git], a distributed version control system, whether you are working as an individual or with a partner. To get started... 1. Do this [Git tutorial][git-tutorial]. 2. Use what you have learned to create your own local git repo. Throughout your development of A3, commit your changes to it. Use those checkins to provide checkpoints, in case you need to restore your development to a previous point. 3. Synch with a remote repo to communicate code between you and your partner. (Or, simply to backup your development if you are working as an individual.) Both [GitHub][github] and [BitBucket][bitbucket] provide free private repos to students. **Private repos are of the utmost importance. A public repo would share your code with the entire world, including your classmates, thus violating the course policy on academic integrity. Therefore we require that you keep all your CS 3110 related code in private repos.** To create a private repository, simply make sure you select the "Private" radio button when creating a new repository at GitHub, or check the "This is a private repository" check box on BitBucket. [git]: https://git-scm.com/ [git-tutorial]: https://try.github.io/ [github]: http://www.github.com [bitbucket]: http://www.bitbucket.org If you sign up for a GitHub or BitBucket account, make sure to sign up as a student with your Cornell `.edu` email to take advantage of the free offers that are available. BitBucket currently seems to account for this automatically based on your email address. For GitHub, it seems you should currently sign up at this link: [https://education.github.com/pack][github-pack]. [github-pack]: https://education.github.com/pack ## Grading issues * **Compiling and running:** If we cannot compile and start the Moogle server exactly according to the instructions given below in Part 0 under "Compiling and running Moogle", your solution will receive minimal credit. * **Code style:** Refer to the [CS 3110 style guide][style]. Ugly code that is functionally correct will nonetheless be penalized. Take extra time to think and find elegant solutions. * **Late submissions:** Carefully review the [course policies][syllabus] on submission and late assignments. Verify before the deadline on CMS that you have submitted the correct version. * **Environment:** Your solution must function properly in the [3110 virtual machine][vm], which is the official grading environment for the course. [style]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/handouts/style.html [syllabus]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/syllabus.php [vm]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/vm.html ## Prohibited OCaml features You may not use imperative data structures, including refs, arrays, mutable fields, and the `Bytes` module. Strings are allowed, but the deprecated mutable functions on them are not. You have not seen these features in lecture or recitation, so we doubt you'll be tempted. Your solutions may not require linking any additional libraries except `unix` and `str`. (Note that for testing purposes the release code violates this prohibition in one place, the `IntStringDictArg` module, but you'll never need to modify or even read that code.) ## Part 0: Understand the Moogle codebase Your preliminary task is to familiarize yourself with the structure of the Moogle source code we have shipped to you. There are approximately 1400 LoC, though you won't need to read all (or even most) of those. We provide the following files in the release code: - Files which you will work closely with: * `order.ml`: definitions for an order datatype used to compare values. * `myset.ml`: an interface and simple implementation of a set abstract datatype. * `dict.ml`: an interface and simple implementation of a dictionary abstract datatype. * `util.ml`: includes an interface and the implementation of crawler services needed to build the web index. This includes definitions of link and page datatypes, a function for fetching a page given a link, and the values of the command line arguments (e.g., the initial link, the number of pages to search, and the server port.) * `crawl.ml`: Includes a stub for the crawler that you will need to complete. - Files which you won't work closely with: * `query.ml`: a datatype for Moogle queries and a function for evaluating a query given a web index. * `moogle.ml`: the main code for the Moogle server. - Files primarily for karma: * `graph.ml`: definitions for a graph abstract data type, including a graph signature, a node signature, and a functor for building graphs from a node module. * `nodescore.ml`: definitions for node scores maps * `pagerank.ml`: skeleton code for the PageRank algorithm, including a dummy algorithm for computing page ranks. **Exercise:** Skim each of those files now, and plan to come back and read them in more detail later as necessary. **Compiling and running Moogle:** From the directory containing `moogle.ml`, run ``` $ cs3110 compile -p unix,str moogle.ml ``` to compile Moogle. Then run Moogle with the following command: ``` $ cs3110 run moogle -- 8080 7 simple-html/index.html ``` * The first command-line argument, `8080`, is the port which your Moogle server binds to. If you already have a Moogle server running on port 8080 (perhaps because you accidentally left an old one running there), you might need to try a different port—for example, 8081 or 8082. (You could also terminate the old Moogle server by using the Unix `ps` command.) * The second command-line argument, `7`, is the upper bound on the number of pages to index. * The third command line argument, `simple-html/index.html`, is the root page on your local file system from which the crawler will start. You should see that the Moogle server starts and prints some debugging information, ending with these lines: ``` Starting Moogle on port 8080. Press Ctrl-c to terminate Moogle. ``` Now connect to Moogle by opening this URL in a browser inside the VM: [http://localhost:8080](http://localhost:8080). (Note: Moogle is not compatible with Safari.) You should see a web page proclaiming "Welcome to Moogle!". Type in a query; you'll get an empty response, because the implementation is not yet finished. ## Part 1: Implement the crawler Your first task is to implement the web crawler and build the search index. To do this, replace the dummy `crawl` function in `crawl.ml` with a function that builds a `WordDict.dict` mapping words to sets of links. See the release code documentation of `crawl` for more details about what to implement. For efficiency, the crawler should visit a page at most once. And you'll need to solve the problem of avoiding an infinite loop when pages link to one another. Running the crawler in utop won't really be feasible. To test your crawler, add code for testing purposes. For example, you could add functions that print the index to a file, or to stdout, or you could write unit tests that assert the index is correct. Note that the data abstractions we provide (i.e, sets, and dictionaries) contain operations for converting values to strings. The value `Moogle.debug` in `moogle.ml` controls whether some debug information is printed; you might find it to be helpful. As a small test case, the directory `simple-html/` contains seven small html files that you can inspect yourself to determine the correct output. You can compare that against the output of your crawler. As larger test cases, you can try `html/index.html` and `wiki/Teenage_Mutant_Ninja_Turtles` as roots. But note that, if you attempt to run your crawler on these larger sets of pages, you are likely to notice it takes a very long time to build an index. This is because the implementations of dictionaries and sets that we provided in the release code deliberately do not scale well. *Hints:* * The `CrawlerServices` module in `util.ml` is useful. The root URL provided on the command line is bound to `CrawlerServices.initial_link`. The upper bound on the number of pages to crawl is bound to `CrawlerServices.num_pages_to_search`. Function `CrawlerServices.get_page` will fetch a page given a link. * The `WordDict` module in `crawl.ml` provides operations for building and manipulating dictionaries mapping words to sets of links. Note how it is defined: by applying a functor to a structure that defines the `key` type as `string` and the `value` type as `LinkSet.set`. The interface for `WordDict` can be found in `dict.ml`. * The `LinkSet` module is defined in `pagerank.ml`. Like `WordDict`, it is defined with a functor, where the element type is specified to be a link. The interface for `LinkSet` can be found in the `myset.ml` file. ## Part 2: Implement efficient data structures A *set* is a data structure that stores *elements*. A set differs from a list in that the order of the elements is irrelevant, and there is no notion of duplicate values. Moogle uses sets extensively. For example, the web index maps words to sets of URLs, and the query evaluator manipulates sets of URLs. The release code provides `ListSet` (in `myset.ml`), which is an simple, inefficient, list-based implementation of sets. A *dictionary* is a data structure that maps *keys* to *values*. Moogle also uses dictionaries extensively. For example, `WordDict` is a dictionary that maps words on a page to the set of links. The release code provides `AssocListDict` (in `dict.ml`), which is a simple, inefficient, list-based implementation of dictionaries. Your next task is to reimplement sets in terms of dictionaries, then to implement an efficient dictionary data structure. You will thus also obtain an efficient set data structure. **Testing:** Although the `Pa_ounit` and `cs3110 test` testing framework you learned about in recitation is great, you will implement a simpler testing infrastructure based on functors in this assignment. There are four functors in the release code with a `run_tests` function: `ListSet`, `DictSet`, `AssocListDict`, and `BTDict`. You need to complete the `run_tests` function for each of those functors. That function should itself call many other unit tests in the functor. For an example of what to do, see the `test_remove_*` functions called in `BTDict.run_tests`. ### 2.1. Reimplement sets as dictionaries Given any implementation of a dictionary, it is possible to produce an implementation of a set. **Step 1:** Figure out how to do that. *Hint: One way to think about a dictionary is that it is, abstractly, a set of (key,value) pairs.* **Step 2:** Uncomment the `DictSet` functor in `myset.ml`. Finish its implementation. You'll need to build a suitable dictionary `D` by calling `Dict.Make`. (*Hint: what can you pass in for the type definitions of keys and values?*) You'll need to build the set definitions in terms of the operations provided by `D`. **Step 3:** Write test functions that exercise your implementation, following the testing schema explained above. **Step 4:** Change the `Make` functor at the bottom of `myset.ml` so that it uses `DictSet` instead of `ListSet`. Test your crawler. Is it still correct? Is it more efficient? ### 2.2. Reimplement dictionaries as balanced trees A *2-3 tree* is a kind of balanced tree data structure, in which nodes may have 2 or 3 children. They were invented by our very own Prof. John Hopcroft in 1970. We'll use them to implement an efficient dictionary. To learn about 2-3 trees, read [this handout][23tree] from Prof. Lyn Turbak at Wellesley College. You'll find a cached copy on [CMS][cms]. **Read every word of that handout. Seriously.** **Have you read it? Good.** [23tree]: http://cs.wellesley.edu/~cs230/fall04/2-3-trees.pdf Here is the type definition for a dictionary implemented as a 2-3 tree: ``` type pair = key * value type dict = | Leaf | Two of dict * pair * dict | Three of dict * pair * dict * pair * dict ``` In addition to nodes that contain two subtrees, this type permits nodes that contain two (key,value) pairs (k1,v1) and (k2,v2), and three subtrees left, middle, and right. Below are the invariants as stated in the handout, with one change that we discuss after stating the invariants. **2-node invariants: Two(left,(k1,v1),right)** * Every key k appearing in subtree left must satisfy k < k1. * Every key k appearing in subtree right must satisfy k > k1. * The length of the path from the 2-node to every leaf in its two subtrees must be the same. **3-node invariants: Three(left,(k1,v1),middle,(k2,v2),right)** * k1 < k2. * Every key k appearing in subtree left must satisfy k < k1. * Every key k appearing in subtree right must satisfy k > k2. * Every key k appearing in subtree middle must satisfy k1 < k < k2. * The length of the path from the 3-node to every leaf in its three subtrees must be the same. All of these bounds in these invariants are strict (< and > vs. <= and >=), unlike the handout. The handout permits storing multiple keys that are equal to one another. But for our dictionary, a key can be stored only once. The last invariants of both types of nodes imply that the tree is balanced, that is, the length of the path from a root node to any `Leaf` node is the same. Open up `dict.ml` and locate the commented-out `BTDict` functor, which is intended to build a `DICT` module from a `DICT_ARG` module. The rest of this section of the writeup will help you complete that functor. *Hint: anytime your code needs to compare keys, use `DICT_ARG.compare`, not `Pervasives.compare`.* #### 2.2.1. balanced Locate this function in `BTDict`: ``` balanced : dict -> bool ``` Implement it. The function takes a 2-3 tree and returns `true` if and only if the tree is balanced. Note that the function needs to check the path length invariants, but should not check the key-ordering invariants. In a comment above the function, document carefully in English how your implementation works. Think carefully about how to efficiently implement this function. (Our solution runs in \$O(n)\$, where \$n\$ is the size of the dictionary). After you have implemented the function: * locate `run_tests` and uncomment the call to `test_balance`. * uncomment the definition of `test_balance` itself. * uncomment the definition of `IntStringBTDict`. Check that all the tests now pass. You'll know they fail if an assertion fails. Now that you have confidence in your `balanced` function, you are required to use it on all your tests involving `insert`, below. #### 2.2.2. strings, fold, lookup, member Implement these functions according to their specification provided in DICT_ARG: ``` val fold : (key -> value -> 'a -> 'a) -> 'a -> dict -> 'a val lookup : dict -> key -> value option val member : dict -> key -> bool val string_of_key : key -> string val string_of_value : value -> string val string_of_dict : dict -> string ``` You may change the order in which the functions appear, but you may not change the signature (name, order of arguments, types) of any of these functions. Note that the presence or absence of `rec` does not change the signature of a function. #### 2.2.3. insert (upward phase) To implement insertion, we'll use two phases—a *downward phase* to find the right place to insert, and an *upward phase* to rebalance the tree, if necessary. To distinguish when we need to rebalance and when we don't, we'll use this type to represent a *kicked-up* configuration: ``` type kicked = | Up of dict * pair * dict (* only two nodes require rebalancing *) | Done of dict (* done rebalancing *) ``` The only kind of node that could potentially require rebalancing is a 2-node. (We can see this pictorially: the only subtrees that require rebalancing are represented by an up arrow in the handout, and only 2-nodes have up arrows.) Hence we make the `Up` constructor take the same arguments as the `Two` constructor. We have provided stubs for functions to perform the upward phase on a 2-node and a 3-node: ``` let insert_upward_two (w: pair) (w_left: dict) (w_right: dict) (x: pair) (x_other: dict) : kicked = ... let insert_upward_three (w: pair) (w_left: dict) (w_right: dict) (x: pair) (y: pair) (other_left: dict) (other_right: dict) : kicked = ... ``` These functions return a `kicked` configuration containing the new balanced tree (cf. page 5 of the handout). This new tree will be wrapped with `Up` or `Done`, depending on whether this new balanced tree requires further upstream rebalancing (indicated by an upward arrow). Implement those functions, as described by the handout. #### 2.2.4. insert (downward phase) The release code provides three mutually recursive functions to simplify the main insert function: ``` let rec insert_downward (d: dict) (k: key) (v: value) : kicked = match d with | Leaf -> ??? (* base case! see handout *) | Two(left,n,right) -> ??? (* mutual recursion! *) | Three(left,n1,middle,n2,right) -> ??? (* mutual recursion! *) and insert_downward_two ((k,v): pair) ((k1,v1): pair) (left: dict) (right: dict) : kicked = ... and insert_downward_three ((k,v): pair) ((k1,v1): pair) ((k2,v2): pair) (left: dict) (middle: dict) (right: dict) : kicked = ... (* the main insert function *) let insert (d: dict) (k: key) (v: value) : dict = match insert_downward d k v with | Up(l,(k1,v1),r) -> Two(l,(k1,v1),r) | Done x -> x ``` Note that, like the upward phase, the downward phase also returns a kicked-up configuration. The main insert function simply extracts the tree from the kicked up configration returned by the downward phases, and returns the tree. Implement these three downward phase functions. You will need to use your upward phase functions that you wrote previously. Next, add testing code for your implementations of `insert`, `lookup`, and `member`. Use the test functions provided for `remove` as examples. The release code provides functions to generate keys, values, and pairs: see `DICT_ARG`. Remember to make `run_tests` call your newly created test functions. #### 2.2.5. remove (upward phase) Like `insert`, the `remove` operation has an upward and downward phase. We have written the downward phase for you. Instead of passing kicked-up configurations, `remove` passes a *hole*: ``` type hole = | Hole of pair option * dict | Absorbed of pair option * dict ``` The `Hole` and `Absorbed` constructors correspond directly to the terminology used in the handout. The constructors take the tree containing the hole, and a pair option containing the removed (key,value) pair. This is useful when we need to remove a (key,value) pair that is in the middle of the tree, hence need to replace it with the minimum (key,value) pair in the current node's right subtree. These two variants indicate which subtree is currently the hole: ``` type direction2 = | Left2 | Right2 type direction3 = | Left3 | Mid3 | Right3 ``` Implement the upward phase (cf. pages 9 and 10 of the handout) by finishing `remove_upward_two` and `remove_upward_three`. Once you have finished writing the upward phase functions, uncomment out the `test_remove_*` functions in `run_tests ()`. All the tests should pass. #### 2.2.6. choose Finally, implement this function according to the specification in `DICT`: ``` choose : dict -> (key * value * dict) option ``` Write tests to test your `choose` function. ## Part 3: Integrate your crawler and your efficient data structures Because the release code is well-designed, you can easily swap in your 2-3 tree dictionaries in place of the list-based dictionaries. Just modify the `Make` functor at the end of `dict.ml`. Now try out your improved search engine on the three sets of webpages. Our startup code prints out the crawling time, so you can see which implementation is faster. If you are confident everything is working, try running a big test case: ``` $ cs3110 run moogle.ml -- 8080 224 wiki/Teenage_Mutant_Ninja_Turtles ``` This could take up to several minutes to crawl and index. You're done! ## Karma <img src="camel_orange.png" height="40" width="40"> <img src="camel_orange.png" height="40" width="40"> <img src="camel_orange.png" height="40" width="40"> **PageRank:** For karma, implement a version of the [PageRank][pagerank] algorithm by following the instructions below. Don't attempt this until you know for sure the rest of your assignment is working. There are two rankers you can implement below. [pagerank]: https://en.wikipedia.org/wiki/PageRank We will apply our knowledge of ADTs and graphs to explore solutions to a compelling problem: finding important nodes in graphs like the Internet, or the set of pages that you're crawling with Moogle. The concept of assigning a measure of importance to nodes is very useful in designing search algorithms, such as those that many popular search engines rely on. Early search engines often ranked the relevance of pages based on the number of times that search terms appeared in the pages. However, it was easy for spammers to game this system by including popular search terms many times, propelling their results to the top of the list. When you enter a search query, you really want the important pages: the ones with valuable information, a property often reflected in the quantity and quality of other pages linking to them. Better algorithms were eventually developed that took into account the relationships between web pages, as determined by links. These relationships can be represented nicely by a graph structure, which is what we'll be using here. Throughout the assignment, we'll want to maintain associations of graph nodes to their importance, or NodeScore: a value between 0 (completely unimportant) and 1 (the only important node in the graph). In order to assign NodeScores to the nodes in a graph, we've provided a module with an implementation of a data structure, `NODE_SCORE`, to hold such associations. The module makes it easy to create, modify, normalize (to sum to 1), and display NodeScores. You can find the module signature and implementation in `nodescore.ml`. To complete this karma, you'll implement a series of NodeScore algorithms in different modules: that is, functions rank that take a graph and return a node_score_map on it. As an example, we've implemented a trivial NodeScore algorithm in the IndegreeRanker module that gives all nodes a score equal to the number of incoming edges. #### Random Walk Ranker In this section, we will walk you through creating a robust ranking algorithm based on random walks. The writeup goes through several steps, and we encourage you to do your implementation in that order. However, you will only submit the final version. You may realize that we need a better way of saying that nodes are popular or unpopular. In particular, we need a method that considers global properties of the graph and not just edges adjacent to or near the nodes being ranked. For example, there could be an extremely relevant webpage that is several nodes removed from the node we start at. That node might normally fare pretty low on our ranking system, but perhaps it should be higher based on there being a high probability that the node could be reached when browsing around on the internet. So consider Sisyphus, doomed to crawl the Web for eternity: or more specifically, doomed to start at some arbitrary page, and follow links randomly. (We assume for the moment that the Web is strongly connected and that every page has at least one outgoing link, unrealistic assumptions that we will return to address soon.) Let's say Sisyphus can take k steps after starting from a random node. We design a system to determine nodescores based off how likely Sisyphus reaches a certain page. In other words, we ask: where will Sisyphus spend most of his time? **Step 1.** Implement rank in the `RandomWalkRanker` functor in `pagerank.ml`, which takes a graph, and returns a normalized NodeScore on the graph, where the score for each node is proportional to the number of times Sisyphus visits the node i n `num_steps` steps, starting from a random node. For now, your function may raise an exception if some node has no outgoing edges. Note that `num_steps` is passed in as part of the functor parameter `P`. You might find the library function `Random.int` useful, and you might want to write some helper functions to pick random elements from lists (don't forget about the empty list case). **Step 2.** Our Sisyphean ranking algorithm does better at identifying important nodes according to their global popularity, rather than being fooled by local properties of the graph. But what about the error condition we mentioned above: that a node might not have any outgoing edges? In this case, Sisyphus has reached the end of the Internet. What should we do? A good solution is to jump to some random page and surf on from there. Modify your rank function so that that instead of raising an error at the end of the Internet, it has Sisyphus jump to a random node. This should work better. **Step 3** What if there are very few pages that have no outgoing edges—or worse, what if there is a set of vertices with no outgoing edges to the rest of the graph, but there are still edges between vertices in the set? Sisyphus would be stuck there forever, and that set would win the node popularity game hands down, just by having no outside links. What we'd really like to model is the fact that Sisyphus doesn't have an unlimited attention span, and starts to get jumpy every now and then... Modify your rank function so that if the module parameter `P.do_random_jumps` is `Some alpha`, then with probability `alpha` at every step, Sisyphus will jump to a random node in the graph, regardless of whether the current node has outgoing edges. (If the current node has no outgoing edges, Sisyphus should still jump to a random node in the graph, as before.) Don't forget to add some testing code. Possible approaches include just running it by hand a lot of times and verifying that the results seem reasonable, or writing an `approx-equal` function to compare the result of a many-step run with what you'd expect to find. Submit your final version of `RandomWalkRanker`. #### QuantumRanker Our algorithm so far works pretty well, but on a huge graph it would take a long time to find all of the hard-to-get-to nodes. (Especially when new nodes are being added all the time...) We'll need to adjust our algorithm somewhat. In particular, let's suppose Sisyphus is bitten by a radioactive eigenvalue, giving him the power to subdivide himself arbitrarily and send parts of himself off to multiple different nodes at once. We have him start evenly spread out among all the nodes. Then, from each of these nodes, the pieces of Sisyphus that start there will propagate outwards along the graph, dividing themselves evenly among all outgoing edges of that node. So, let's say that at the start of a step, we have some fraction q of Sisyphus at a node, and that node has 3 outgoing edges. Then q/3 of Sisyphus will propagate outwards from that node along each edge. This will mean that nodes with a lot of value will make their neighbors significantly more important at each timestep, and also that in order to be important, a node must have a large number of incoming edges continually feeding it importance. Thus, our basic algorithm takes an existing graph and NodeScore, and updates the NodeScore by propagating all of the value at each node to all of its neighbors. However, there's one wrinkle: we want to include some mechanism to simulate random jumping. The way that we do this is to use a parameter `alpha`, similarly to what we did in exercise 2. At each step, each node propagates a fraction `1-alpha` of its value to its neighbors as described above, but also a fraction `alpha` of its value to all nodes in the graph. This will ensure that every node in the graph is always getting some small amount of value, so that we never completely abandon nodes that are hard to reach. We can model this fairly simply. If each node distributes `alpha` times its value to all nodes at each timestep, then at each timestep each node accrues `(alpha/n)` times the overall value in this manner. Thus, we can model this effect by having the base NodeScore at each timestep give `(alpha/n)` to every node. That gives us our base case for each timestep of our quantized Sisyphus algorithm. What else do we need to do at each timestep? Well, as explained above, we need to distribute the value at each node to all of its neighbors. **Step 1.** Write a function `propagate_weight` that takes a node, a graph, a NodeScore that it's building up, and the NodeScore from the previous timestep, and returns the new NodeScore resulting from distributing the weight that the given node had in the old NodeScore to all of its neighbors. Remember that only `(1-alpha)` of the NodeScore gets distributed among the neighbors (`alpha` of the weight was distributed evenly to all nodes in the base case). A value for `alpha` is passed in to the functor as part of the `P` parameter. Note: To handle the case of nodes that have no outgoing edges, assume that each node has an implicit edge to itself. **Step 2.** Now we have all the components we need. Implement the rank function in the `QuantumRanker` module. It should return the NodeScore resulting from applying the updating procedures described above. The number of timesteps to simulate is passed in to the functor as part of the `P` parameter. Don't forget to add some tests! #### Using your page rankers At the top of `crawl.ml` is a definition of a `MoogleRanker` module that's used to order the search results. Replace the call to the `IndegreeRanker` functor with calls to the other rankers you wrote, and experiment with how it changes the results. * * * **Acknowledgement:** Adapted from Dean Greg Morrisett.