CS312 Lecture 10:
Graphs. Trees. Binary Search Trees

Trees are pervasive in Computer Science, and they represent one of the most important data structures.

Graphs

We start with the more general concept of a graph. A graph consists of a set of nodes (also called vertices) and edges that connect these nodes together. In an undirected graph, every two distinct nodes either may be connected or disconnected (typically, an undirected graph is defined so as to exclude nodes connected to themselves, i.e. such a graph has no loops). In a undirected (simple) graph, there can be at most one edge between any two nodes of the graph. If more than one edge is allowed between any two nodes, then we talk about multigraphs.

A subgraph is a graph whose sets of vertices and/or edges are subsets of the respective sets of another (given) graph. A subgraph of a graph is called a connected component of the original graph if any node of the respective subgraph can be reached from any other node of the subgraph. A connected component to which we can not add further nodes is called a maximal connected component. If the maximal connected component of a graph is the graph itself, then we say that a graph is connected; otherwise the graph is disconnected. A graph with n nodes will have at most n connected components.

A cycle is a sequence of nodes in which an edge goes from each node in the sequence to the next, and an edge goes from the last node in the sequence to the first one. It is possible for a vertex to appear more than once in a cycle. If a graph has no cycles, it is said to be acyclic.

It is possible to associate a direction with each edge in a graph - such graphs are called directed graphs. A directed graph allows for at most two edges of opposite orientation between any pair of vertices. We will not discuss directed graphs further in this lecture.

Trees

Trees are a particularly important kind of graph. A graph T with n vertices is a tree if any of the following conditions is satisfied:

T is a undirected, connected, acyclic graph.
T is connected and has exactly n-1 nodes.
T is maximal without cycles (i.e. adding one edge to a graph creates at least one cycle).
T is minimally connected (i.e. removing one edge makes the graph disconnected).

These conditions are all equivalent; one can show, for example, that (1) => (2) => (3) => (4) => (1).

Despite their apparent simplicity trees have a lot of other interesting properties. Trees represent one of the most important data structures you will encounter. The interest in trees is partly explained by their extremal properties (see the definitions above), but also by the fact that the structure of binary trees maps naturally to a sequence of if-then-else decisions.

In many applications, but not always, one node of the tree is distinguished from the rest and is called the root node. Because a tree is a connected graph, every node is reachable from the root. Because the tree is acyclic, there is only one way to get from the root to any given node. It is a convention in Computer Science to draw trees upside down: the root node is drawn at the top and every other node is drawn below it, with any given node drawn below the other nodes on the path from the root to that node. The depth of a node is the number of edges that must be traversed to get from the root node to that node. The depth of the root is zero. The height of a tree is the largest depth of any node in the tree.

Every node N (except the root) is connected by an edge to a exactly a single node whose depth is one less. This node is called the parent node of N. The other nodes to which N is connected, if any, have depth one greater than N, and are called children of N. The nodes along the path from the root to a node N are called the ancestorsof N, and the nodes whose paths to the root include N (other than N itself) are called the descendants of N. In general a node may have any number of children. The number of children of a node is called the degree of a node. If a node has degree zero (no children), it is called a leaf (or external) node. Other nodes are known as internal nodes. A subtree of node N is a set of descendants of a particular node, plus that node as the root of the subtree.

Binary Trees

A binary tree is a tree in which every node has at most degree two. Conventionally, a descendant of an internal node in a binary tree is called the left child or the right child of the respective internal node (the names are obvious if you think of the graphical representation of a tree). A node of degree two must have one of each. In most cases, a child node will be identified as a left child or right child even if it is the only child of its parent.

Trees are very easy to define in SML. A binary tree either contains no nodes at all, or it contains a root node with a left subtree and a right subtree. Here is how we might declare a tree that stores integers at every node:

datatype inttree = Empty | Node of inttree * int * inttree

Here is an example:

     2    
   /   \   
  1     4        
       / \ 
      3   5

Node(Node(Empty, 1, Empty), 2, Node(Node(Empty, 3, Empty), 4, Node(Empty, 5, Empty)))

If we felt that the tuple didn't document things enough, we could define the type using a record:

datatype inttree = Empty | Node of {left: inttree, value: int, right: inttree}

Binary trees can be generalized to trees that are similar but have degree up to k nodes. Such trees are called k-ary trees. Each node can have up to k children, and each child has a distinct index in the range 1..k (or 0..k-1). Thus, a binary tree is just a k-ary tree with k=2.

There was nothing about the datatypes above that required them to contain integers. We can define a parameterized tree type just as easily:

datatype 'a tree = Empty | Node of 'a tree * 'a * 'a tree

A k-ary tree is full if every internal node has degree k and every leaf node has the same depth. Suppose that a tree has degree k at all internal nodes, and all leaf nodes have depth h. A tree of height 0 has 1 node, of height 1 has k + 1 nodes, of height 2 has k² + k + 1 nodes, etc. That is, a full k-ary tree of height h has at least k^h nodes; in fact, it has Σ_i=0,hk^h nodes. With a few simple manipulations we see this is equal to (k^h+1-1)/(k-1). Because this expression is exponential in h, even a relatively short path through a k-ary tree can get us to a huge amount of data. Using this formula, we see that a full binary tree of height h contains 2^h+1-1 nodes.

Traversals

It is very easy to write a recursive function to traverse a binary tree; that is, to visit all of its nodes. There are three obvious ways to write the traversal code. In each case the recursive function visits the subtrees of the current node recursively, and also inspects the value of the current node. However, there is a choice about when to inspect the value of the current node.

Note: you have discussed these traversal orders informally in section, but we include them here for completeness. We illustrate the various traversals by writing a fold-type function for each of them.

In a pre-order traversal, the value at the node is considered before the two subtrees.

fun pre_fold (f: 'a * 'b -> 'b) (b0: 'b) (t:'a tree) = 
  case t of
    Empty => b0
  | Node(l, k, r) => let
                       val b1:'b = f(k, b0)
                       val b2:'b = pre_fold f b1 l
                     in
                       pre_fold f b2 r
                     end

Note that we used a let expression to express the sequence of evaluations that characterizes the pre-order traversal: process the node first, then the left subtree followed by the right subtree. This is a very readable representation of the steps involved in the traversal; a more compact solution is given below:

fun pre_fold (f: 'a * 'b -> 'b) (b0: 'b) (t:'a tree) = 
  case t of
    Empty => b0
  | Node(l, k, r) => pre_fold f (pre_fold f (f(k, b0)) l) r

In a post-order traversal, the value at the node is considered after the two subtrees:

fun post_fold (f: 'a * 'b -> 'b) (b0: 'b) (t:'a tree) = 
  case t of
    Empty => b0
  | Node(l, k, r) => let
                       val b1:'b = post_fold f b0 l
                       val b2:'b = post_fold f b1 r
                     in
                       f(k, b2)
                     end

In an in-order traversal, the value at the node is considered between the two subtrees:

fun in_fold (f: 'a * 'b -> 'b) (b0: 'b) (t:'a tree) = 
  case t of
    Empty => b0
  | Node(l, k, r) => let
                       val b1:'b = in_fold f b0 l
                       val b2:'b = f(k, b1)
                     in
                       in_fold f b2 r
                     end

One might argue that other systematic traversal orders exist: what about the order in which the right subtree is processed first, then the root of the current subtree followed by the left subtree? It is easy to see that this traversal becomes a simple in-order traversal is we start at the root tand we recursively flip the position of the left and right subtrees. As these "alternative" traversal orders can be trivially reduced to one of the three traversal orders given above, they are not studied separately.

There are problems whose solutions map naturally to one of the three traversals given above. For example, if we need to produce an increasingly ordered list of the values stored in a tree, then an in-order traversal is most natural. If the tree represents the structure of an arithmetic expression with nodes representing operators and subtrees representing operands, then a post-order traversal is best adapted to the problem (operands must be evaluated before their associated operator).

Binary Search Trees

Of course, we don't really want to traverse the whole tree to find a data element. Suppose that we want to find a value in the tree, and assume that the base type of these values has an ordering which allows us to compare any two values of the respective type. A binary search tree lets us exploit this ordering to find elements efficiently. A binary search tree is a binary tree that satisfies the following invariant:

For each node in the tree, the elements stored in its left subtree are all strictly less than the element of the node, and the elements stored in its right subtree are all strictly greater than the node.

Note that for empty trees, trees with one node, and leaf nodes this property holds vacuously. Also, the definition above precludes the existence of duplicate values in the tree. Should duplicate values be necessary, their existence is typically represented in the information carried by the node, and not by having several nodes identified by the same value (key).

When a tree satisfies the data structure invariant, an in-order traversal inspects the value of each node in ascending order.

Finding Elements in a BST

The ordering invariant allows for efficient navigation of the tree to find an element e if it is present, or to determine that the respective element is not in the tree. Arriving at a given node, we can compare e to the value k stored at the node. If e it is equal to k, then we have found its location in the tree. Otherwise, e it is either less than or greater than k, in which case we know that the element, if present, must be found in the left subtree or the right subtree respectively. If e is not present in the tree, then sooner or later we will reach an Empty node.

fun contains (t:'a tree, e:'a, cmp: 'a * 'a -> order) =
  case t of
    Empty => false
  | Node(l, k, r) => case cmp(e, k) of
                       LESS => contains (l, e, cmp)
                     | EQUAL => true
                     | GREATER => contains(r, e, cmp)

Given a binary tree of height h, this function will make at most h recursive calls as it walks down the path from the root to the leaves of the tree.

Inserting Elements into a BST

Suppose that we have a binary search tree, and we would like to create a new binary search tree that contains one additional element. We can write this recursively too:

fun add(t:'a tree, e:'a, cmp: 'a * 'a ->order): 'a tree =
  case t of
    Empty => Node(Empty, e, Empty)
  | Node(l, k, r) => case cmp(e, k) of
                         LESS => Node(add (l, e, cmp), k, r)
                       | EQUAL => t
                       | GREATER => Node(l, e, add(r, e, cmp))

This code is simple and will make at most h recursive calls when inserting into a tree of height h. However, there is a lurking performance problem. Suppose that we insert a series of n elements that are always increasing in value. In this case the code will always follow the GREATER arm of the case expression and will build a tree that looks just like a linked list of length n! Therefore looking up an element might require looking at the entire tree. We will see later how to do a better job.

 1
  \
   2              A degenerate tree (essentially a list) results when 
    \             we insert an ordered sequence (inserted sequence: 1, 2, 3, 4, 5).
     3
      \    
       4
        \
         5

Finding Ranges in a BST

Recall that an in-order traversal visits nodes in ascending order of their elements. We can use this fact to efficiently find all the elements in a tree in a range (interval) between two elements a0 and a1 such that a0 < a1 (we denode by < the generalized comparison operation). For example, we can write a fold operation that only considers such elements:

fun ifold (f: 'a * 'b -> 'b) (b0: 'b) (t:'a tree) (cmp: 'a * 'a -> order) (a0: 'a) (a1: 'a) =
  case t of
    Empty => b0
  | Node(l:'a tree, k:'a, r:'a tree) =>
      case (cmp(a0, k), cmp(a1, k)) of
        (LESS, LESS) => ifold f b0 l cmp a0 a1
      | (GREATER, GREATER) => ifold f b0 r cmp a0 a1
      | (LESS, EQUAL) => f(k, ifold f b0 l cmp a0 a1)
      | (EQUAL, GREATER) => ifold f (f(k, b0)) r cmp a0 a1
      | (LESS, GREATER) => ifold f (f(k, ifold f b0 l cmp a0 a1)) r cmp a0 a1

Function cmp(a, b) returns LESS if a < b, EQUAL if a = b, and GREATER if a > b.

This code will only visit the nodes in the tree that are within the range and the ancestors of those nodes, which is potentially quite efficient. You will understand this code if you consider the various possibilities that can occur (k is the value, or the key stored in the current node):

(LESS, GREATER)

	`Case`	`(cmp(a0, k), cmp(a1, k))`	`Location of Values in [a0, a1]`
1.	`k < a0 < a1`	`(GREATER, GREATER)`	right subtree
2.	`k = a0 < a1`	`(EQUAL, GREATER)`	current node, right subtree
3.	`a0 < k < a1`	left subtree, current node, right subtree
4.	`a0 < a1 = k`	`(LESS, EQUAL)`	left subtree, current node
5.	`a0 < a1 < k`	`(LESS, LESS)`	left subtree

Note that the problem, as we specified it above, does not actually impose an order in which the values in the interval [a0, a1] are processed. For the solution, we chose to process these values in their natural order: the smallest value first, then the second-smallest value, and so on until the largest value in the interval. This ordering imposes a visiting order analogous to the pre-order traversal, we might call it a "truncated" pre-order traversal.

Cases (1) and (5), and (2) and (4), are analogous. Note, however, that the code that corresponds to these case pairs is different: in one case the function is applied to the current node before the recursive call, while in the other case the function is applied to the current node after the recursive call.

Let us reexamine for a minute the most complex case, which we reproduce below:

      | (LESS, GREATER) => irange f (f(k, irange f b0 l cmp a0 a1)) r cmp (a0, a1)

As we mentioned when discussing function pre_fold, we could have written this case in a more explicit format, like so:

      | (LESS, GREATER) => let
                             val b1 = irange f b0 l cmp a0 a1
                             val b2 = f(k, b1)
                           in
                             irange f b2 r cmp a0 a1
                           end

In general, expression (cmp(a, b), cmp(c, d)) can have 9 values. As we have seen, for our problem, only 5 cases are relevant. Do the other four cases matter? Well, yes. It you typed in our solution, you might have noticed the non-exhaustive match warning that you got. The SML compiler does not know that four cases are excluded, and - in general - it has no way to determine this. Warnings should be treated as errors, not only because of our course policy, but also because they are likely to point to unsafe, error prone, programming patterns.

We can fix the problem by rewriting the last case as follows:

      | _ => irange f (f(k, irange f b0 l cmp a0 a1)) r cmp (a0, a1)

The "indiferent" pattern that we introduced solved the warning problem, but introduces another one, possibly more insidious. What if you make an error, and your function arguments are such that they violate your assumptions and "impossible" cases suddenly become possible? Well, in such a case the default branch would be executed. In the case of a complex program, whose result you might not necessariy be able to anticipate or check, you might not even notice that your algorithm is flawed. Your innocent change suppressed a warning but opened the door for silent errors.

Such situations are more frequent that you might think, and they explain many of the programming errors that occasionally make a computer user's life miserable. The right solution, of course, is not to eliminate "impossible" cases, but to explicitly test for them; this can often be done with minimal overhead in terms of code length and execution time. You will save yourself a lot of debugging time if you follow this principle, so make sure you understand it.

A solution that eliminates the warning, but does not silently suppress an otherwise easily detectable error case (when a0 >= a1, contrary to our assumptions), is given below:

fun ifold (f: 'a * 'b -> 'b) (b0: 'b) (t:'a tree) (cmp: 'a * 'a -> order) (a0: 'a) (a1: 'a) =
  case t of
    Empty => b0
  | Node(l:'a tree, k:'a, r:'a tree) =>
      case (cmp(a0, k), cmp(a1, k)) of
        (LESS, LESS) => ifold f b0 l cmp a0 a1
      | (GREATER, GREATER) => ifold f b0 r cmp a0 a1
      | (LESS, EQUAL) => f(k, ifold f b0 l cmp a0 a1)
      | (EQUAL, GREATER) => ifold f (f(k, b0)) r cmp a0 a1
      | (LESS, GREATER) => ifold f (f(k, ifold f b0 l cmp a0 a1)) r cmp a0 a1
      | _ => raise Fail "internal error; <impossible> condition occured"

Printing a Tree

It is often useful to print out a tree in a form that allows for easy inspection of its structure. As it turns out, this is easy to do even in an environment where we can only generate text output. Here is one possible solution:

fun printTree (t: 'a tree) (p: 'a -> unit): unit = 
  let
    fun spaces (n: int): string = if n = 0 then "" else " " ^ (spaces (n - 1))
    fun helper (t: 'a tree) (p: 'a -> unit) (n: int): unit =
      case t of
        Empty => print ((spaces n) ^ "Empty\n")
      | Node(l, k, r) => (helper r p (n + 2);   
                          print (spaces n); p k; print "\n";
                          helper l p (n + 2))
   in
     helper t p 0
end

Sample input:

printTree (Node(Node(Empty, 1, Empty), 2, Node(Node(Empty, 3, Empty), 4, Node(Empty, 5, Empty))))
          (fn x: int => print (Int.toString x))

Sample output (look at it while tilting your head 90 degrees to the left):

      Empty
    5
      Empty
  4
      Empty
    3
      Empty
2
    Empty
  1
    Empty

Function printTree "computes" unit (the 0-tuple). As there is only one unit value, all functions that return it are equivalent with respect to the value they compute. From this perspective, the following is a perfect replacement of the original definition:

fun printTree (t: 'a tree) (p: 'a -> unit): unit = unit

If the computed (i.e. returned) value of these two functions is identical, what makes the first version of the function more interesting than the second? Well, obviously, the first version is interesting because it processes, and prints out information that reflects the structure of the tree. The printing does not produce anything else but units, but it is still useful. Such interactions of a program with the environment in which it runs are called side effects, to emphasize that they happen apart (but possibly as a consequence) of computations. Another obvious example of side effect would be reading in values from the keyboard. As we stated before, pure functional programming does not allow for side-effects, but most practical implementations accommodate them.

CS312 Lecture 10: Graphs. Trees. Binary Search Trees