# A3: Moogle
**Soft deadline:** Thursday, 10/08/15, 11:59 pm
**Hard deadline:** Saturday, 10/10/15, 11:59 pm
(Fall Break begins at 1:10 pm on 10/10/15. You are in no way required
to work on A3 over Fall Break. It is up to you whether you choose to
take the automatic extension until the hard deadline.)
*This assignment may be done as an individual or with one
partner. Sharing of code is permitted only between you and your partner;
sharing among larger groups is prohibited. You can use the "Search for
Teammates" feature on Piazza to find a partner.*
In this assignment, you will develop a web search engine.
Your search engine will be named **Moogle** in honor of our new dean,
Greg Morrisett, and [his love of cows][cows].
[cows]: https://www.quora.com/Why-is-Professor-Greg-Morrisett-so-fond-of-cows
## Overview
Moogle is designed with a client–server architecture.
As with most search engines, the client is simply a web
browser that connects over http to a server. You therefore
won't have to worry about implementing the client or the
connector.
The Moogle server is implemented in OCaml. When it is first
loaded, it is pointed to a *root page*. From that root
page, it *crawls* looking for links to other pages. As it
proceeds, it *indexes* the pages that it finds. After it
finishes indexing, it is ready to process client requests.
The Moogle client (i.e., home page) allows the user to enter
a *query*. The query may be a word (e.g., "moo"), a
conjunctive query (e.g., "moo AND cow"), or a disjunctive
query (e.g., "moo OR cow"). By default, queries are treated
as conjunctive. So "moo cow" is treated as "moo AND cow",
hence returns pages that have both the word "moo" and the
word "cow". The query is sent to the Moogle server, which
computes the set of URLs that satisfy the query. For
example, given the query "moo OR cow", Moogle computes the
set of URLs of all indexed pages that contain either "moo" or "cow".
Moogle builds an html page as a response and sends the page back
to the client.
Some of the Moogle server is already implemented for you.
Your task is to implement the crawler and indexer. You will
also need to implement efficient data structures for sets
and dictionaries. For karma, you can implement a well-known
algorithm that prioritizes search results.
**Warning:** Because crawling the Internet demands careful attention
to protocols that well-behaved search engines are expected to obey
(in particular, `robots.txt`), this assignment is limited to crawling web
pages on your local disk. We strongly recommend that you **do not direct
your crawler to the Internet.**
## Objectives
* Write code that makes extensive use of the OCaml module system, including
structures, signatures, and functors.
* Work inside a large(ish) code base that you did not write yourself.
* Implement an efficient dictionary data structure using balanced search trees.
* Practice writing programs in the functional style using immutable data.
* Use Git, a distributed version control system.
## Recommended reading
* [Lectures and recitations 8–11][web]
* [RWO chapters 4 and 9][rwo]
* [Handout on 2-3 trees][23tree] (also available from [CMS][cms])
[web]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/
[rwo]: https://realworldocaml.org/v1/en/html/index.html
## Requirements
The primary requirements are the following:
* You must implement the web crawler and indexer, a set data structure (implemented
with dictionaries), and a dictionary data structure (implemented with 2-3 trees),
as described below unders Parts 1 and 2.
* You must complete the data-structure testing infrastructure in the release code, as
described below under Part 2.
* We must be able to compile and run your search engine as described below under
"Compiling and running Moogle" in Part 0.
* Your code must be written with good style and be well documented.
* You must use Git (or another version control system) as part of your development
process.
## What we provide
In the release code you will find these files:
* Many `.ml` files, which are described below under "Part 0: Understand the Moogle
codebase".
* A template file `a3.txt` for submitting your written feedback.
## What to turn in
Submit files with these names on [CMS][cms]:
* `a3src.zip`, containing your solution code.
* `vclog.txt`, containing your version control log.
* `a3.txt`, containing your written feedback.
[cms]: https://cms.csuglab.cornell.edu/
**To prepare `a3src.zip` for submission:**
From the directory that contains `moogle.ml`, bundle all your source code into
a zip file with this command:
```
$ zip a3src.zip *.ml
```
Do not include any compiled bytecode files, otherwise your submission
might become too big to upload.
Double check that you got all your source files with this command:
```
$ zipinfo -1 a3src.zip
```
When run on the release code, the proper output would be:
```
crawl.ml
dict.ml
graph.ml
moogle.ml
myset.ml
nodescore.ml
order.ml
pagerank.ml
query.ml
util.ml
```
**To produce vclog.txt for submission**: Run the following
command in the directory containing `moogle.ml`:
```
$ git log --stat > vclog.txt
```
## Git
You are required to use [Git][git], a distributed version
control system, whether you are working as an individual or
with a partner. To get started...
1. Do this [Git tutorial][git-tutorial].
2. Use what you have learned to create your own local
git repo. Throughout your development of
A3, commit your changes to it. Use those checkins
to provide checkpoints, in case you need to restore
your development to a previous point.
3. Synch with a remote repo to communicate code between you
and your partner. (Or, simply to backup your development
if you are working as an individual.) Both [GitHub][github]
and [BitBucket][bitbucket] provide free private
repos to students.
**Private repos are of the utmost importance.
A public repo would share your code with the entire world, including your
classmates, thus violating the course policy on academic integrity.
Therefore we require that you keep all your CS 3110 related code in private repos.**
To create a private repository, simply make sure you select the "Private" radio
button when creating a new repository at GitHub, or check the
"This is a private repository" check box on BitBucket.
[git]: https://git-scm.com/
[git-tutorial]: https://try.github.io/
[github]: http://www.github.com
[bitbucket]: http://www.bitbucket.org
If you sign up for a GitHub or BitBucket account, make sure to sign up as a student
with your Cornell `.edu` email to take advantage of the free offers that are
available. BitBucket currently seems to account for this automatically
based on your email address. For GitHub, it seems you should currently
sign up at this link: [https://education.github.com/pack][github-pack].
[github-pack]: https://education.github.com/pack
## Grading issues
* **Compiling and running:** If we cannot compile and start the Moogle server exactly
according to the instructions given below in Part 0 under "Compiling and
running Moogle", your solution will receive minimal credit.
* **Code style:** Refer to the [CS 3110 style guide][style].
Ugly code that is functionally correct will nonetheless be penalized.
Take extra time to think and find elegant solutions.
* **Late submissions:** Carefully review the [course policies][syllabus] on
submission and late assignments. Verify before the deadline on CMS
that you have submitted the correct version.
* **Environment:** Your solution must function properly in the
[3110 virtual machine][vm], which is the official grading
environment for the course.
[style]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/handouts/style.html
[syllabus]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/syllabus.php
[vm]: http://www.cs.cornell.edu/Courses/cs3110/2015fa/vm.html
## Prohibited OCaml features
You may not use imperative data structures, including refs, arrays,
mutable fields, and the `Bytes` module. Strings are allowed, but the
deprecated mutable functions on them are not. You have not seen these
features in lecture or recitation, so we doubt you'll be tempted. Your
solutions may not require linking any additional libraries except `unix` and `str`.
(Note that for testing purposes the release code violates this prohibition
in one place, the `IntStringDictArg` module, but you'll never need to modify
or even read that code.)
## Part 0: Understand the Moogle codebase
Your preliminary task is to familiarize yourself with the structure of the
Moogle source code we have shipped to you. There are approximately 1400 LoC,
though you won't need to read all (or even most) of those. We provide the
following files in the release code:
- Files which you will work closely with:
* `order.ml`: definitions for an order datatype used to compare values.
* `myset.ml`: an interface and simple implementation of a set abstract datatype.
* `dict.ml`: an interface and simple implementation of a dictionary abstract datatype.
* `util.ml`: includes an interface and the implementation of crawler services needed to build the web index. This includes definitions of link and page datatypes, a function for fetching a page given a link, and the values of the command line arguments (e.g., the initial link, the number of pages to search, and the server port.)
* `crawl.ml`: Includes a stub for the crawler that you will need to complete.
- Files which you won't work closely with:
* `query.ml`: a datatype for Moogle queries and a function for evaluating a query given a web index.
* `moogle.ml`: the main code for the Moogle server.
- Files primarily for karma:
* `graph.ml`: definitions for a graph abstract data type, including a graph signature, a node signature, and a functor for building graphs from a node module.
* `nodescore.ml`: definitions for node scores maps
* `pagerank.ml`: skeleton code for the PageRank algorithm,
including a dummy algorithm for computing page ranks.
**Exercise:** Skim each of those files now, and plan to come back and read them in
more detail later as necessary.
**Compiling and running Moogle:**
From the directory containing `moogle.ml`, run
```
$ cs3110 compile -p unix,str moogle.ml
```
to compile Moogle. Then run Moogle with the following command:
```
$ cs3110 run moogle -- 8080 7 simple-html/index.html
```
* The first command-line argument, `8080`, is the port which your Moogle
server binds to. If you already have a Moogle server running on port 8080
(perhaps because you accidentally left an old one running there), you might
need to try a different port—for example, 8081 or 8082.
(You could also terminate the old Moogle server by using the Unix `ps` command.)
* The second command-line argument, `7`, is the upper bound on the number of pages
to index.
* The third command line argument, `simple-html/index.html`, is
the root page on your local file system from which the crawler will start.
You should see that the Moogle server starts and prints some debugging
information, ending with these lines:
```
Starting Moogle on port 8080.
Press Ctrl-c to terminate Moogle.
```
Now connect to Moogle by opening this URL in a browser
inside the VM:
[http://localhost:8080](http://localhost:8080).
(Note: Moogle is not compatible with Safari.)
You should see a web page proclaiming "Welcome to Moogle!". Type in a
query; you'll get an empty response, because the implementation
is not yet finished.
## Part 1: Implement the crawler
Your first task is to implement the web crawler and build the search
index. To do this, replace the dummy `crawl` function in `crawl.ml` with
a function that builds a `WordDict.dict` mapping words to sets of links.
See the release code documentation of `crawl` for more details about
what to implement. For efficiency, the crawler should visit a page at most once.
And you'll need to solve the problem of avoiding an infinite loop when pages
link to one another.
Running the crawler in utop won't really be feasible. To test your
crawler, add code for testing purposes. For example, you could add
functions that print the index to a file, or to stdout, or you could
write unit tests that assert the index is correct. Note that the data
abstractions we provide (i.e, sets, and dictionaries) contain operations
for converting values to strings. The value `Moogle.debug` in
`moogle.ml` controls whether some debug information is printed; you
might find it to be helpful.
As a small test case, the directory `simple-html/` contains seven small
html files that you can inspect yourself to determine the correct output.
You can compare that against the output of your crawler.
As larger test cases, you can try `html/index.html` and
`wiki/Teenage_Mutant_Ninja_Turtles` as roots. But note that, if you
attempt to run your crawler on these larger sets of pages, you are
likely to notice it takes a very long time to build an index. This is
because the implementations of dictionaries and sets that we provided in
the release code deliberately do not scale well.
*Hints:*
* The `CrawlerServices` module in `util.ml` is useful. The root URL
provided on the command line is bound to `CrawlerServices.initial_link`.
The upper bound on the number of pages to crawl is bound to
`CrawlerServices.num_pages_to_search`. Function
`CrawlerServices.get_page` will fetch a page given a link.
* The `WordDict` module in `crawl.ml` provides operations for building
and manipulating dictionaries mapping words to sets of links. Note how
it is defined: by applying a functor to a structure that defines the
`key` type as `string` and the `value` type as `LinkSet.set`. The
interface for `WordDict` can be found in `dict.ml`.
* The `LinkSet` module is defined in `pagerank.ml`. Like `WordDict`, it
is defined with a functor, where the element type is specified to be a
link. The interface for `LinkSet` can be found in the `myset.ml` file.
## Part 2: Implement efficient data structures
A *set* is a data structure that stores *elements*. A set differs from
a list in that the order of the elements is irrelevant, and there is no
notion of duplicate values. Moogle uses sets extensively. For example,
the web index maps words to sets of URLs, and the query evaluator
manipulates sets of URLs. The release code provides `ListSet` (in
`myset.ml`), which is an simple, inefficient, list-based implementation
of sets.
A *dictionary* is a data structure that maps *keys* to *values*.
Moogle also uses dictionaries extensively. For example, `WordDict` is a
dictionary that maps words on a page to the set of links.
The release code provides `AssocListDict` (in `dict.ml`), which
is a simple, inefficient, list-based implementation of dictionaries.
Your next task is to reimplement sets in terms of dictionaries,
then to implement an efficient dictionary data structure. You will
thus also obtain an efficient set data structure.
**Testing:**
Although the `Pa_ounit` and `cs3110 test` testing framework
you learned about in recitation is great, you will implement
a simpler testing infrastructure based on functors in this
assignment. There are four functors in the release code with
a `run_tests` function: `ListSet`, `DictSet`,
`AssocListDict`, and `BTDict`. You need to complete the
`run_tests` function for each of those functors. That
function should itself call many other unit tests in the
functor. For an example of what to do, see the
`test_remove_*` functions called in `BTDict.run_tests`.
### 2.1. Reimplement sets as dictionaries
Given any implementation of a dictionary, it is possible to produce
an implementation of a set.
**Step 1:**
Figure out how to do that.
*Hint: One way to think about a dictionary is that it is,
abstractly, a set of (key,value) pairs.*
**Step 2:**
Uncomment the `DictSet` functor in `myset.ml`.
Finish its implementation. You'll need to build a suitable dictionary `D` by
calling `Dict.Make`. (*Hint: what can you pass in for the
type definitions of keys and values?*) You'll need to build the set
definitions in terms of the operations provided by `D`.
**Step 3:**
Write test functions that exercise your implementation, following the testing
schema explained above.
**Step 4:**
Change the `Make` functor at the bottom of `myset.ml` so that it
uses `DictSet` instead of `ListSet`. Test your crawler.
Is it still correct? Is it more efficient?
### 2.2. Reimplement dictionaries as balanced trees
A *2-3 tree* is a kind of balanced tree data structure, in which
nodes may have 2 or 3 children. They were invented by our very own
Prof. John Hopcroft in 1970. We'll use them to implement
an efficient dictionary.
To learn about 2-3 trees, read [this handout][23tree] from Prof. Lyn
Turbak at Wellesley College. You'll find a cached copy on [CMS][cms].
**Read every word of that handout. Seriously.**
**Have you read it? Good.**
[23tree]: http://cs.wellesley.edu/~cs230/fall04/2-3-trees.pdf
Here is the type definition for a dictionary implemented as a 2-3 tree:
```
type pair = key * value
type dict =
| Leaf
| Two of dict * pair * dict
| Three of dict * pair * dict * pair * dict
```
In addition to nodes that contain two subtrees, this type permits nodes
that contain two (key,value) pairs (k1,v1) and (k2,v2), and
three subtrees left, middle, and right. Below are the invariants as
stated in the handout, with one change that we discuss after stating
the invariants.
**2-node invariants: Two(left,(k1,v1),right)**
* Every key k appearing in subtree left must satisfy k < k1.
* Every key k appearing in subtree right must satisfy k > k1.
* The length of the path from the 2-node to every leaf in its two subtrees must be the same.
**3-node invariants: Three(left,(k1,v1),middle,(k2,v2),right)**
* k1 < k2.
* Every key k appearing in subtree left must satisfy k < k1.
* Every key k appearing in subtree right must satisfy k > k2.
* Every key k appearing in subtree middle must satisfy k1 < k < k2.
* The length of the path from the 3-node to every leaf in its three subtrees must be the same.
All of these bounds in these invariants are strict (< and > vs. <= and >=),
unlike the handout. The handout permits storing multiple keys that are
equal to one another. But for our dictionary, a key can be
stored only once.
The last invariants of both types of nodes imply that the
tree is balanced, that is, the length of the path from a root node to
any `Leaf` node is the same.
Open up `dict.ml` and locate the commented-out `BTDict` functor,
which is intended to build a `DICT` module from a `DICT_ARG` module.
The rest of this section of the writeup will help you complete
that functor.
*Hint: anytime your code needs to compare keys, use
`DICT_ARG.compare`, not `Pervasives.compare`.*
#### 2.2.1. balanced
Locate this function in `BTDict`:
```
balanced : dict -> bool
```
Implement it. The function takes a 2-3 tree and returns `true` if and
only if the tree is balanced. Note that the function needs to check the
path length invariants, but should not check the key-ordering
invariants. In a comment above the function, document carefully in
English how your implementation works.
Think carefully about how to efficiently implement this function. (Our
solution runs in \\(O(n)\\), where \\(n\\) is the size of the
dictionary).
After you have implemented the function:
* locate `run_tests` and uncomment the call to `test_balance`.
* uncomment the definition of `test_balance` itself.
* uncomment the definition of `IntStringBTDict`.
Check that all the tests now pass. You'll know they fail if an assertion
fails.
Now that you have confidence in your `balanced` function, you
are required to use it on all your tests involving `insert`, below.
#### 2.2.2. strings, fold, lookup, member
Implement these functions according to their specification provided in DICT_ARG:
```
val fold : (key -> value -> 'a -> 'a) -> 'a -> dict -> 'a
val lookup : dict -> key -> value option
val member : dict -> key -> bool
val string_of_key : key -> string
val string_of_value : value -> string
val string_of_dict : dict -> string
```
You may change the order in which the functions appear, but you may not
change the signature (name, order of arguments, types) of any of these
functions. Note that the presence or absence of `rec` does not change
the signature of a function.
#### 2.2.3. insert (upward phase)
To implement insertion, we'll use two phases—a *downward phase* to find the
right place to insert, and an *upward phase* to rebalance the tree, if
necessary. To distinguish when we need to rebalance and when we don't,
we'll use this type to represent a *kicked-up* configuration:
```
type kicked =
| Up of dict * pair * dict (* only two nodes require rebalancing *)
| Done of dict (* done rebalancing *)
```
The only kind of node that could potentially require rebalancing is a
2-node. (We can see this pictorially: the only subtrees that require
rebalancing are represented by an up arrow in the handout, and only 2-nodes
have up arrows.) Hence we make the `Up` constructor take the same
arguments as the `Two` constructor. We have provided stubs for functions
to perform the upward phase on a 2-node and a 3-node:
```
let insert_upward_two (w: pair) (w_left: dict) (w_right: dict)
(x: pair) (x_other: dict) : kicked = ...
let insert_upward_three (w: pair) (w_left: dict) (w_right: dict)
(x: pair) (y: pair) (other_left: dict) (other_right: dict) : kicked = ...
```
These functions return a `kicked` configuration containing the new
balanced tree (cf. page 5 of the handout). This new tree will be
wrapped with `Up` or `Done`, depending on whether this new balanced tree
requires further upstream rebalancing (indicated by an upward arrow).
Implement those functions, as described by the handout.
#### 2.2.4. insert (downward phase)
The release code provides three mutually recursive functions
to simplify the main insert function:
```
let rec insert_downward (d: dict) (k: key) (v: value) : kicked =
match d with
| Leaf -> ??? (* base case! see handout *)
| Two(left,n,right) -> ??? (* mutual recursion! *)
| Three(left,n1,middle,n2,right) -> ??? (* mutual recursion! *)
and insert_downward_two ((k,v): pair) ((k1,v1): pair)
(left: dict) (right: dict) : kicked = ...
and insert_downward_three ((k,v): pair) ((k1,v1): pair) ((k2,v2): pair)
(left: dict) (middle: dict) (right: dict) : kicked = ...
(* the main insert function *)
let insert (d: dict) (k: key) (v: value) : dict =
match insert_downward d k v with
| Up(l,(k1,v1),r) -> Two(l,(k1,v1),r)
| Done x -> x
```
Note that, like the upward phase, the downward phase also returns a
kicked-up configuration. The main insert function simply extracts the
tree from the kicked up configration returned by the downward phases,
and returns the tree.
Implement these three downward phase functions. You will need to use
your upward phase functions that you wrote previously.
Next, add testing code for your implementations of `insert`, `lookup`,
and `member`. Use the test functions provided for `remove` as examples.
The release code provides functions to generate keys, values, and pairs:
see `DICT_ARG`. Remember to make `run_tests` call your newly created
test functions.
#### 2.2.5. remove (upward phase)
Like `insert`, the `remove` operation has an upward and downward phase.
We have written the downward phase for you.
Instead of passing kicked-up configurations, `remove` passes a *hole*:
```
type hole =
| Hole of pair option * dict
| Absorbed of pair option * dict
```
The `Hole` and `Absorbed` constructors correspond directly to the
terminology used in the handout. The constructors take the tree
containing the hole, and a pair option containing the removed
(key,value) pair. This is useful when we need to remove a (key,value)
pair that is in the middle of the tree, hence need to replace it
with the minimum (key,value) pair in the current node's right subtree.
These two variants indicate which subtree is currently the hole:
```
type direction2 =
| Left2
| Right2
type direction3 =
| Left3
| Mid3
| Right3
```
Implement the upward phase (cf. pages 9 and 10 of the handout) by
finishing `remove_upward_two` and `remove_upward_three`. Once you have
finished writing the upward phase functions, uncomment out the
`test_remove_*` functions in `run_tests ()`. All the tests should pass.
#### 2.2.6. choose
Finally, implement this function according to the specification in `DICT`:
```
choose : dict -> (key * value * dict) option
```
Write tests to test your `choose` function.
## Part 3: Integrate your crawler and your efficient data structures
Because the release code is well-designed, you can
easily swap in your 2-3 tree dictionaries in place of the list-based
dictionaries.
Just modify the `Make` functor at the end of `dict.ml`.
Now try out your improved search engine on the three sets of
webpages. Our startup code prints out the crawling time, so you can see
which implementation is faster.
If you are confident everything is working, try running a big test case:
```
$ cs3110 run moogle.ml -- 8080 224 wiki/Teenage_Mutant_Ninja_Turtles
```
This could take up to several minutes to crawl and index.
You're done!
## Karma
**PageRank:**
For karma, implement a version of the [PageRank][pagerank] algorithm by following
the instructions below. Don't attempt this until you know for sure
the rest of your assignment is working. There are two rankers
you can implement below.
[pagerank]: https://en.wikipedia.org/wiki/PageRank
We will apply our knowledge of ADTs and graphs to explore solutions to a
compelling problem: finding important nodes in graphs like the Internet,
or the set of pages that you're crawling with Moogle.
The concept of assigning a measure of importance to nodes is very useful
in designing search algorithms, such as those that many popular search
engines rely on. Early search engines often ranked the relevance of
pages based on the number of times that search terms appeared in the
pages. However, it was easy for spammers to game this system by
including popular search terms many times, propelling their results to
the top of the list.
When you enter a search query, you really want the important pages: the
ones with valuable information, a property often reflected in the
quantity and quality of other pages linking to them. Better algorithms
were eventually developed that took into account the relationships
between web pages, as determined by links. These relationships can be
represented nicely by a graph structure, which is what we'll be using
here.
Throughout the assignment, we'll want to maintain associations of graph
nodes to their importance, or NodeScore: a value between 0 (completely
unimportant) and 1 (the only important node in the graph).
In order to assign NodeScores to the nodes in a graph, we've provided
a module with an implementation of a data structure, `NODE_SCORE`, to hold such
associations. The module makes it easy to create, modify, normalize (to
sum to 1), and display NodeScores. You can find the module signature
and implementation in `nodescore.ml`.
To complete this karma, you'll implement a series of NodeScore algorithms in
different modules: that is, functions rank that take a graph and return
a node_score_map on it. As an example, we've implemented a trivial
NodeScore algorithm in the IndegreeRanker module that gives all nodes
a score equal to the number of incoming edges.
#### Random Walk Ranker
In this section, we will walk you through creating a robust ranking
algorithm based on random walks. The writeup goes through several steps,
and we encourage you to do your implementation in that order. However,
you will only submit the final version.
You may realize that we need a better way of saying that nodes are
popular or unpopular. In particular, we need a method that considers
global properties of the graph and not just edges adjacent to or near
the nodes being ranked. For example, there could be an extremely
relevant webpage that is several nodes removed from the node we start
at. That node might normally fare pretty low on our ranking system, but
perhaps it should be higher based on there being a high probability that
the node could be reached when browsing around on the internet.
So consider Sisyphus, doomed to crawl the Web for eternity: or more
specifically, doomed to start at some arbitrary page, and follow links
randomly. (We assume for the moment that the Web is strongly connected
and that every page has at least one outgoing link, unrealistic
assumptions that we will return to address soon.)
Let's say Sisyphus can take k steps after starting from a random node.
We design a system to determine nodescores based off how likely Sisyphus
reaches a certain page. In other words, we ask: where will Sisyphus
spend most of his time?
**Step 1.**
Implement rank in the `RandomWalkRanker` functor in `pagerank.ml`, which
takes a graph, and returns a normalized NodeScore on the graph, where the
score for each node is proportional to the number of times Sisyphus
visits the node i n `num_steps` steps, starting from a random node. For
now, your function may raise an exception if some node has no outgoing
edges. Note that `num_steps` is passed in as part of the functor parameter
`P`.
You might find the library function `Random.int` useful, and you might want to
write some helper functions to pick random elements from lists (don't
forget about the empty list case).
**Step 2.**
Our Sisyphean ranking algorithm does better at identifying important
nodes according to their global popularity, rather than being fooled by
local properties of the graph. But what about the error condition we
mentioned above: that a node might not have any outgoing edges? In this
case, Sisyphus has reached the end of the Internet. What should we do? A
good solution is to jump to some random page and surf on from there.
Modify your rank function so that that instead of raising an error at
the end of the Internet, it has Sisyphus jump to a random node.
This should work better.
**Step 3**
What if there are very few pages that have
no outgoing edges—or worse, what if there is a set of vertices with
no outgoing edges to the rest of the graph, but there are still edges
between vertices in the set? Sisyphus would be stuck there forever, and
that set would win the node popularity game hands down, just by having
no outside links. What we'd really like to model is the fact that
Sisyphus doesn't have an unlimited attention span, and starts to get
jumpy every now and then...
Modify your rank function so that if the module parameter
`P.do_random_jumps` is `Some alpha`, then with probability `alpha` at
every step, Sisyphus will jump to a random node in the graph, regardless
of whether the current node has outgoing edges. (If the current node has
no outgoing edges, Sisyphus should still jump to a random node in the
graph, as before.)
Don't forget to add some testing code. Possible approaches include
just running it by hand a lot of
times and verifying that the results seem reasonable, or writing an
`approx-equal` function to compare the result of a many-step run with what
you'd expect to find.
Submit your final version of `RandomWalkRanker`.
#### QuantumRanker
Our algorithm so far works pretty well, but on a huge graph it would
take a long time to find all of the hard-to-get-to nodes. (Especially
when new nodes are being added all the time...)
We'll need to adjust our algorithm somewhat. In particular, let's
suppose Sisyphus is bitten by a radioactive eigenvalue, giving him the
power to subdivide himself arbitrarily and send parts of himself off to
multiple different nodes at once. We have him start evenly spread out
among all the nodes. Then, from each of these nodes, the pieces of
Sisyphus that start there will propagate outwards along the graph,
dividing themselves evenly among all outgoing edges of that node.
So, let's say that at the start of a step, we have some fraction q of
Sisyphus at a node, and that node has 3 outgoing edges. Then q/3 of
Sisyphus will propagate outwards from that node along each edge. This
will mean that nodes with a lot of value will make their neighbors
significantly more important at each timestep, and also that in order to
be important, a node must have a large number of incoming edges
continually feeding it importance.
Thus, our basic algorithm takes an existing graph and NodeScore, and
updates the NodeScore by propagating all of the value at each node to
all of its neighbors. However, there's one wrinkle: we want to include
some mechanism to simulate random jumping. The way that we do this is to
use a parameter `alpha`, similarly to what we did in exercise 2. At each
step, each node propagates a fraction `1-alpha` of its value to its
neighbors as described above, but also a fraction `alpha` of its value to
all nodes in the graph. This will ensure that every node in the graph is
always getting some small amount of value, so that we never completely
abandon nodes that are hard to reach.
We can model this fairly simply. If each node distributes `alpha` times
its value to all nodes at each timestep, then at each timestep each node
accrues `(alpha/n)` times the overall value in this manner. Thus, we can
model this effect by having the base NodeScore at each timestep give
`(alpha/n)` to every node.
That gives us our base case for each timestep of our quantized Sisyphus
algorithm. What else do we need to do at each timestep? Well, as
explained above, we need to distribute the value at each node to all of
its neighbors.
**Step 1.**
Write a function `propagate_weight` that takes a node, a graph, a
NodeScore that it's building up, and the NodeScore from the previous
timestep, and returns the new NodeScore resulting from distributing
the weight that the given node had in the old NodeScore to all of its
neighbors. Remember that only `(1-alpha)` of the NodeScore gets
distributed among the neighbors (`alpha` of the weight was distributed
evenly to all nodes in the base case). A value for `alpha` is passed in to
the functor as part of the `P` parameter. Note: To handle the case of
nodes that have no outgoing edges, assume that each node has an implicit
edge to itself.
**Step 2.**
Now we have all the components we need. Implement the rank function in
the `QuantumRanker` module. It should return the NodeScore resulting
from applying the updating procedures described above. The number of
timesteps to simulate is passed in to the functor as part of the `P`
parameter.
Don't forget to add some tests!
#### Using your page rankers
At the top of `crawl.ml` is a definition of a `MoogleRanker` module
that's used to order the search results. Replace the call to the
`IndegreeRanker` functor with calls to the other rankers you wrote, and
experiment with how it changes the results.
* * *
**Acknowledgement:** Adapted from Dean Greg Morrisett.