CS312 Lecture 8: Graphs. Trees. Binary Search Trees

Trees are one of the most important data structures in computer science. We can think of a tree both as a mathematical abstraction and as a very concrete data structure used to efficiently implement other abstractions such as sets and dictionaries. The ML language turns out to be very well designed for manipulating trees.

Graphs

We start with the more general concept of a graph. A graph consists of a set of nodes (also called vertices) and edges that connect these nodes together. In an undirected graph, every two distinct nodes either may be connected or disconnected (a node may not be connected to itself). A subgraph is a graph whose sets of vertices and/or edges are subsets of the respective sets of another (given) graph. A connected subgraph of a graph is called a connected component of the graph. A connected component to which we can not add further nodes is called a maximal connected component. A graph with n nodes will have at most n connected components.

 The nodes may be any kind of object; typically, the node objects contain some additional information that is to be stored at that location in the graph. We may draw a graph pictorially using labeled dots or circles for nodes and lines to represent the edges that connect them.

There is another important kind of graph, directed graphs, which we will talk about later. In these graphs the edges have directionality and we draw them as arrows.

An undirected graph may be connected, if every node is reachable from every other node  (a node is reachable from another node if it can be reached by following some sequence of edges). If some nodes are not reachable from other nodes, the graph is disconnected.

A cycle is a sequence of nodes in which an edge goes from each node in the sequence to the next, and an edge goes from the last node in the sequence to the first one. Pictorially it looks like a loop. If a graph has no cycles, it is said to be acyclic.

Trees

Trees are a particularly important kind of graph. A graph T with n vertices is a tree if any of the following conditions is satisfied:

1.      A tree is a undirected, connected, acyclic graph.

2.      The graph is connected and has exactly n-1 nodes.

3.      The graph is maximal without cycles (i.e. adding one edge to a graph creates at least one cycle).

4.      The graph is minimally connected (i.e. removing one edge makes the graph disconnected).

These conditions are all equivalent – one can show, for example, that (1) => (2) => (3) => (4) => (1).  Despite their apparent simplicity trees have a lot of other interesting properties. Trees represent one of the most important data structures you will encounter. The interest in trees is partly explained by their extremal properties (see the definitions above), but also by the fact that the structure of binary trees maps naturally to a sequence of if…then…else decisions.

In many applications, but not always, one node of the tree is distinguished from the rest and is called the root node. Because a tree is a connected graph, every node is reachable from the root. Because the tree is acyclic, there is only one way to get from the root to any given node. It is a convention in computer science to draw trees upside down: the root node is drawn at the top and every other node is drawn below it, with any given node drawn below the other nodes on the path from the root to that node. The depth of a node is the number of edges that must be traversed to get from the root node to that node. The depth of the root is zero. The height of a tree is the largest depth of any node in the tree.

Every node N (except the root) is connected by an edge to a exactly a single node whose depth is one less. This node is called the parent node of N. The other nodes to which N is connected, if any, have depth one greater than N, and are called children of N. The nodes along the path from the root to a node N are called the ancestors of N, and the nodes whose paths to the root include N (other than N itself) are called the descendants of N. In general a node may have any number of children. The number of children of a node is called the degree of a node. If a node has degree zero (no children), it is called a leaf (or external) node. Other nodes are known as internal nodes. A subtree is a set of descendants of a particular node, plus that node as the root of the subtree.

Binary Trees

Generally a tree structure by itself is not very useful. It becomes interesting when we attach information to the nodes of the tree. We can then navigate down the tree to find information of interest within the tree. In order to find data in a tree efficiently we will need to impose some restrictions on where data is placed within the tree, and we will need to keep more information about the ordering of the children of a given node.

A binary tree is a tree in which every node has at most degree two. Conventionally, a descendant of an internal node in a binary tree is called the left child or the right child of the respective internal node (the names are obvious if you think of the graphical representation of a tree). A node of degree two must have one of each.

Trees are very easy to define in SML. A binary tree either contains no nodes at all (Empty), or it contains a root node with a left subtree and a right subtree. Here is how we might declare a tree that stores integers at every node:

datatype inttree = Empty | Node of inttree * int * inttree

Here is an example:

     2
   /  \
1                  4   Node(Node(Empty, 1, Empty), 2, Node(Node(Empty, 3, Empty), 4, Node(Empty, 5, Empty)))
    / \
   3   5

If we felt that the tuple didn't document things enough, we could define the type using a record:

datatype inttree = Empty | Node of {left: inttree, value: int, right: inttree}

Binary trees can be generalized to trees that are similar but have degree up to k nodes. Such trees are called k-ary trees. Each node can have up to k children, and each child has a distinct index in the range 1..k (or 0..k-1). Thus, a binary tree is just a k-ary tree with  k=2.

There was nothing about the datatypes above that required them to contain integers. We can define a parameterized tree type just as easily:

datatype 'a tree = Empty | Node of 'a tree * 'a * 'a tree
 

A k-ary tree is full if every internal node has degree k and every leaf node has the same depth. Suppose that a tree has degree k at all internal nodes, and all leaf nodes have depth h. A tree of height 0 has 1 node, of height 1 has k+1 nodes, of height 2 has k2+k+1 nodes, etc. That is, a full k-ary tree of height h has at least kh nodes; in fact, it has Si=0,hkh nodes. With a few simple manipulations we see this is equal to (kh+1-1)/(k-1). Using this formula, we see that a full binary tree of height h contains 2h+1-1 nodes. Because this expression is exponential in h, even a relatively short path through a k-ary tree can get us to a huge amount of data.

Traversals

It is very easy to write a recursive function to traverse a binary (or k-ary) tree; that is, to visit all of its nodes. There are three obvious ways to write the traversal code. In each case the recursive function visits the subtrees of the current node recursively, and also inspects the value of the current node. However, there is a choice about when to inspect the value of the current node. For example, we can write three versions of a fold function that operates on trees:

In a pre-order traversal, the value at the node is considered before the two subtrees.

fun fold_pre (f: 'a*'b -> 'b) (b0: 'b) (t:'a tree) =
  case t of
    Empty => b0
  | Node(lf:'a tree, v:'a, rg:'a tree) =>
      let val b1:'b = f(v,b0)
          val b2:'b = fold_pre f b1 lf in
                      fold_pre f b2 rg
      end

In a post-order traversal, the value at the node is considered after the two subtrees:

    | Node(lf:'a tree, v:'a, rg:'a tree) =>
        let val b1:'b = fold_post f b0 lf
            val b2:'b = fold_post f b1 rg in
                        f(v, b2)
        end

In an in-order traversal, the value at the node is considered between the two subtrees:

    | Node(lf:'a tree, v:'a, rg:'a tree) =>
        let val b1:'b = fold_in f b0 lf
            val b2:'b = f(v, b1) in
                        fold_in f b2 rg
        end

Binary Search Trees

Of course, we don't really want to traverse the whole tree to find a data element. Suppose that we want to find a value in the tree, and assume that the base type of these values is has an ordering, which allows us to compare any two values of the respective type. A binary search tree lets us exploit this ordering to find elements efficiently. A binary search tree is a binary tree that satisfies the following invariant:

For each node in the tree, the elements stored in its left subtree are all strictly less than the element of the node, and the elements stored in its right subtree are all strictly greater than the node.

Note that for leaf nodes this property holds vacuously. Also, the definition above precludes the existence of duplicate values in the tree.

When a tree satisfies the data structure invariant, an in-order traversal inspects the value of each node in ascending order.

Finding elements

This invariant allow efficient navigation of the tree to find an element e if it is present. Arriving at a given node, we can compare e to the value e' stored at the node. If it is equal to v, then we have found it. Otherwise, it is either less than or greater than v, in which case we know that the element, if present, must be found in the left subtree or the right subtree respectively.

   fun contains (t:'a tree, e:'a, cmp:'a*'a->order) =
      case t of
         Empty => false
       | Node(lf, v, rg) =>
            case cmp(e, v) of
               LESS => contains (lf, e, cmp)
             | EQUAL => true
             | GREATER => contains(rg, e, cmp)

Given a binary tree of height h, this function will make at most h recursive calls as it walks down the path from the root to the leaves of the tree.

Inserting elements

Suppose that we have a binary search tree, and we would like to create a new binary search tree that contains one additional element. We can write this recursively too:

   fun add(t:'a tree, e:'a, cmp:'a*'a->order): 'a tree =
      case t of
         Empty => Node(Empty, e, Empty)
       | Node(lf, v, rg) =>
            case cmp(e, v) of
               LESS => Node(add (lf, e, cmp), v, rg)
             | EQUAL => t
             | GREATER => Node(lf, e, add(rg, e, cmp))

This code is simple and will make at most h recursive calls when inserting into a tree of height h. However, there is a lurking performance problem. Suppose that we insert a series of n elements that are always increasing in value. In this case the code will always follow the GREATER arm of the case expression and will build a tree that looks just like a linked list of length n! Therefore looking up an element might require looking at the entire tree. We will see later how to do a better job.

 1
  \
   2
    \
     3     A degenerate tree (essentially a list) results when we insert an ordered sequence.
      \    (inserted sequence: 1, 2, 3, 4, 5)
       4
        \
         5

Finding ranges

Recall that an in-order traversal visits nodes in ascending order of their elements. We can use this fact to efficiently find all the elements in a tree in a range between two elements. For example, we can write a fold operation that only considers such elements:

fun fold_range (f: 'a*'b -> 'b) (b0: 'b) (t:'a tree) (cmp:'a*'a->order) (a0:'a, a1:'a) =
  case t of
    Empty => b0
  | Node(lf:'a tree, v:'a, rg:'a tree) =>
      case (cmp(a0,v), cmp(a1,v)) of
        (LESS, LESS) => fold_range f b0 lf cmp (a0,a1)
      | (GREATER, GREATER) => fold_range f b0 rg cmp (a0,a1)
      | (LESS, EQUAL) => f(v, fold_range f b0 lf cmp (a0, a1))
      | (EQUAL, GREATER) => fold_range f (f(v,b0)) rg cmp (a0,a1)
      | (_, _) => fold_range f (f(v, fold_range f b0 lf cmp (a0,a1))) rg cmp (a0,a1)

This code will only visit the nodes in the tree that are within the range and the ancestors of those nodes, which is potentially quite efficient.

Printing a Tree

It is often useful to print out a tree in a form that allows for easy inspection of its structure. As it turns out, this is easy to do even in an environment where we can only generate text output. Here is one possible solution:

datatype 'a tree = Empty | Node of 'a tree * 'a * 'a tree
 
fun printTree (t: 'a tree) (p: 'a -> unit): unit =
let
  fun spaces (n: int): string = 
    if n = 0 then "" else " " ^ (spaces (n - 1))
  fun helper (t: 'a tree) (p: 'a -> unit) (n: int): unit = 
  case t of
    Empty => print ((spaces n) ^ "Empty\n")
  | Node(l, v, r) => (helper r p (n + 2);
                      print (spaces n); p v; print "\n";
                      helper l p (n + 2))
in
  helper t p 0
end
 
 

Sample input:

 
printTree (Node(Node(Empty, 1, Empty), 2, Node(Node(Empty, 3, Empty), 4, Node(Empty, 5, Empty))))
          (fn x: int => print (Int.toString(x)))
 

Sample output (look at it while tilting your head 90 degrees to the left):

 
      Empty
    5
      Empty
  4
      Empty
    3
      Empty
2
    Empty
  1
    Empty
 

Function printTree does not return any value, it “computes” unit. As there is only one unit value, it would appear that all functions that return it must be equivalent. From the point of view of the value returned the function above is indistinguishable from the this one:

fun printTree (t: 'a tree) (p: 'a -> unit): unit = unit

If the computed (i.e. returned) value of these two functions is identical, what makes the first version of the function more interesting than the second? Well, obviously, the first version is interesting because it processes a tree and prints out information that reflects the structure of the tree. The printing does not produce anything else but units, but it is still useful. Such interactions of a program with the environment in which it runs are called side effects, to emphasize that they happen apart (but possibly as a consequence) of computations. Another obvious example of side effect (interaction between the running program and the environment it is running in) would be reading in values from the keyboard.  Pure functional programming does not allow side-effects, but most practical implementations accommodate them. Later in the course we will learn about more sophisticated side effects.