CS312 Lecture 11: Big-O notation; Red-black trees

Overview

Administrivia: PS#3 due a week from now (less 9 hours).

Last time: graphs and BST's (intro).

This time: Big-O notation; Red-black trees

Overview of big-O notation (more details in a week)

When we say that an algorithm runs in time T(n), we mean that T(n) is an upper bound on the running time that holds for all inputs of size n. This is called worst-case analysis. The algorithm may very well take less time on some inputs of size n, but it doesn't matter. If an algorithm takes T(n)=c*n2+k steps on only a single input of each size n and only n steps on the rest, we still say that it is a quadratic algorithm.

In estimating the running time of any program we don't know what the constants c or k are. We know that it is a constant of moderate size, but other than that it is not important; we have enough evidence from the asymptotic analysis to know that a linear algorithm is faster than the quadratic one, even though the constants may differ somewhat. (This does not always hold, the constants can sometimes make a difference, but in general it is a very good rule-of-thumb.)

Computer scientists have developed a convenient notation for hiding the constant factor. We write O(n) (read: ''order n'') instead of ''cn for some constant c.'' Thus an algorithm is said to be O(n) or linear time if there is a fixed constant c such that for all sufficiently large n, the algorithm takes time at most cn on inputs of size n. An algorithm is said to be O(n2) or quadratic time if there is a fixed constant c such that for all sufficiently large n, the algorithm takes time at most cn2 on inputs of size n. O(1) means constant time.

Some common orders of growth seen often in complexity analysis are

O(1)	constant
O(log n)	logarithmic
O(n)	linear
O(n log n)	"n log n"
O(n²)	quadratic
O(n³)	cubic
n^O(1)	polynomial
2^O(n)	exponential

O(n)

Let f and g be functions from positive integers to positive integers. We say f is O(g) (read: "f is order g") if there exists a fixed constant c and a fixed n0 such that
" n > n0 (f(n) < cg(n))
Equivalently, f is O(g) if the function f(n)/g(n) is bounded above.

Red-black trees

Sets are a very important and useful abstraction. In last lecture we saw various ways to implement an abstract data type for sets. None of these implementations were state-of-the-art. Now that we're good at determining asymptotic running time, it's time to see an implementation of sets that is asymptotically more efficient and better in practice for most applications of interest.

In this lecture, we focus on sets of integers for simplicity. We can easily generalize the code we will write to manipulate expressions of a type 'a by passing around a comparison function cmp: 'a*'a -> order.

The signature that we will work with is a little different from that in last Lecture:

signature INTSET = sig
  (* a "set" is a set of integers: e.g., {1,-11,0}, {}, and {1001}*)
  type set
  (* empty is the empty set *)
  val empty : set
  (* insert(x,s) is {x} union s. *)
  val insert: int*set -> set
  (* union is set union. *)
  val union: set*set -> set
  (* contains(x,s) is whether x is a member of s *)
  val contains: int*set -> bool
  (* size(s) is the number of elements in s *)
  val size: set->int
  (* fold over the elements of the set *)
  val fold: ((int * 'b)->'b) -> 'b -> set -> 'b
end

This differs from our earlier signature by replacing single with insert. This makes sense because insertion of a single element is a common use that often can be implemented more efficiently than general union. We can use insert to implement union as follows, inserting the elements from s2 into s1 one at a time:

fun union(s1, s2) = fold insert s1 s2

Our most efficient implementation of sets was as a list of integers with no repetition. What is the asymptotic running time of the operations, on a list of length n?

insert? O(n), because we have to check the entire list to make sure the element is not already in it. (This means that set union will take O(n²) time -- not good!)
contains? O(n), because we might have to scan down the entire list in the case where the element is not there.

So it's not very efficient. Is there a better way to implement a set than a sorted list? By better we mean as having asymptotically faster operations. Binary search trees are one approach.

Binary Search Trees

type value = int
datatype btree = Empty 
               | Node of {value: value, left:btree, right:btree}
type set = btree

A binary search tree is a binary tree with the following rep invariant: for any node n, every node in n.left has a value less than that of n, and every node in n.right has a value more than that of n.

Given such a tree, how do you perform a lookup operation? Start from the root, and at every node, if the value of the node is what you are looking for, you are done; otherwise, recursively lookup in the left or right subtree depending on the value stored at the node. In code:

fun contains (n:int, t:btree): bool = 
  (case t
     of Empty => false
      | Node {value,left,right} => 
          (case Int.compare (value,n)
             of EQUAL => true
              | GREATER => contains (n,left)
              | LESSER => contains (n,right)))

Insertion is similar: you perform a lookup until you find the empty node that should contain the value. In code:

fun insert (n:int, t:btree):btree = 
  (case t 
     of Empty => Node {value=n, left=Empty, right=Empty}
      | Node {value,left,right} => 
          (case Int.compare (value,n)
             of EQUAL => t
              | GREATER => Node {value=value,
                                 left=insert (n,left),
                                 right=right}
              | LESSER => Node {value=value,
                                left=left,
                                right=insert (n,right)}))

What is the running time of those operations? Since insert is just a lookup with an extra node creation, we focus on the lookup operation. Clearly, an analysis of the code shows that insert is O(height of the tree). What's the worst-case height of a tree? Clearly, a tree of n nodes all in a single long branch (imagine inserting the numbers 1,2,3,4,5,6,7 in order into a binary search tree). So the worst-case running time of lookup is still O(n) (for n the number of nodes in the tree).

What is a good shape for a tree that would allow for fast lookup? A balanced, "bushy" tree; for example:

          ^                   50
          |               /        \
          |           25              75
 height=4 |         /    \          /    \
          |       10     30        60     90
          |      /  \   /  \      /  \   /  \
          V     4   12 27  40    55  65 80  99

If a tree with n nodes is kept balanced, its height is O(lg n), which leads to a lookup operation running in time O(lg n).

How can we keep a tree balanced? Many techniques involve inserting an element just like in a normal binary search tree, followed by some kind of tree surgery to rebalance the tree. For example:

AVL (or height-balanced) trees (1962)
2-3 trees (1970's)
Red-black trees

Red-Black Trees

The idea is to strengthen the rep invariants of the binary search tree so that trees are always approximately balanced. To help enforce the invariants, we color each node of the tree either red or black:

datatype color = Red | Black
datatype rbtree = Empty
                | Node of {color: color, value: int,
                           left:rbtree, right:rbtree}
type set = rbtree

Here are the new conditions we add to the binary search tree rep invariant:

No red node has a red parent
Every path from the root to an empty node has the same number of black nodes

Note that empty nodes are considered always to be black. If a tree satisfies these two conditions, it must also be the case that every subtree of the tree also satisfies the conditions. If a subtree violated either of the conditions, the whole tree would also.

With these invariants, the longest possible path from the root to an empty node would alternately contain red and black nodes; therefore it is at most twice as long as the shortest possible path, which only contains black nodes. If n is the number of nodes in the tree, the longest possible path has length 2 lg n, which is O(lg n). Therefore, the tree has height O(lg n) and the operations are all asymptotically efficient.

How do we check for membership in red-black trees? The same way as for general binary trees:

fun contains (n: int, t:rbtree): bool = 
  (case t
     of Empty => false
      | Node {color,value,left,right} => 
          (case Int.compare (value, n)
             of EQUAL => true
              | GREATER => contains (n,left)
              | LESSER => contains (n,right)))

More interesting is the insert operation. We proceed as we said we would: we insert at the empty node that a standard insertion into a binary search tree indicates. We also color the inserted node red to ensure that invariant #2 is preserved. However, we may destroy invariant #1 in doing so, by producing two red nodes, one the parent of the other. The next figure shows all the possible cases that may arise:

       1             2            3             4

       B_z            B_z           B_x            B_x
      /  \          / \          /  \          /  \
     R_y  d         R_x  d        a    R_z       a    R_y
    /  \          / \               /  \          /  \
  R_x   c         a   R_y            R_y   d        b    R_z
 /  \               /  \          / \                /  \
a    b             b    c        b   c              c    d

Notice that in each of these trees, the values of the nodes in a,b,c,d must have the same relative ordering with respect to x, y, and z: a<x<b<y<c<z<d. Therefore, we can perform a local "tree rotation" to restore the invariant locally, while possibly breaking invariant 1 one level up in the tree:

     R_y
    /  \
  B_x    B_z
 / \   / \
a   b c   d

By performing a rebalance of the tree at that level, and all the levels above, we can do tree surgery to locally enforce invariant #1. In the end, we may end up with two red nodes, one of them the root and the other the child of the root; this we can easily correct by coloring the root black. The SML code (which really shows the power of pattern matching!) is as follows:

fun insert (n:int, t:rbtree): rbtree = let
  (* Definition: a tree t satisfies the "reconstruction invariant" if it is
   * black and satisfies the rep invariant, or if it is red and its children
   * satisfy the rep invariant. *)

  (* makeBlack(t) is a tree that satisfies the rep invariant.
     Requires: t satisfies the reconstruction invariant
     Algorithm: Make a tree identical to t but with a black root. *)
  fun makeBlack (t:rbtree): rbtree = 
    case t
       of Empty => Empty
        | Node {color,value,left,right} =>
          Node {color=Black, value=value,
                left=left, right=right}
  (* Construct the result of a red-black tree rotation. *)
  fun rotate(x: value, y: value, z: value,
             a: rbtree, b: rbtree, c:rbtree, d: rbtree): rbtree =
    Node {color=Red, value=y,
          left= Node {color=Black, value=x, left=a, right=b},
          right=Node {color=Black, value=z, left=c, right=d}}
  (* balance(t) is a tree that satisfies the reconstruction invariant and
   * contains all the same values as t.
   * Requires: the children of t satisfy the reconstruction invariant. *)
  fun balance (t:rbtree): rbtree = 
    case t of
      (*1*) Node {color=Black, value=z,
                  left= Node {color=Red, value=y,
                              left=Node {color=Red, value=x,
                                         left=a, right=b},
                              right=c},
                  right=d} => rotate(x,y,z,a,b,c,d)
    | (*2*) Node {color=Black, value=z,
                  left=Node {color=Red, value=x,
                             left=a,
                             right=Node {color=Red, value=y,
                                         left=b, right=c}},
                  right=d} => rotate(x,y,z,a,b,c,d)            
    | (*3*) Node {color=Black, value=x,
                 left=a,
                 right=Node {color=Red, value=z,
                             left=Node {color=Red, value=y,
                                        left=b, right=c},
                             right=d}} => rotate(x,y,z,a,b,c,d)
    | (*4*) Node {color=Black, value=x,
                  left=a,
                  right=Node {color=Red, value=y,
                              left=b,
                              right=Node {color=Red, value=z,
                                          left=c, right=d}}} =>  
            rotate(x,y,z,a,b,c,d)
    | _ => t
 
  (* Insert x into t, returning a tree that satisfies the reconstruction
     invariant. *)
  fun walk (t:rbtree):rbtree = 
    case t
       of Empty => Node {color=Red, value=n, left=Empty, right=Empty}
        | Node {color,value,left,right} => 
           (case Int.compare (value,n) 
              of EQUAL => t
               | GREATER => balance (Node {color=color,
                                           value=value,
                                           left=walk (left)
                                           right=right})
               | LESSER => balance (Node {color=color,
                                          value=value,
                                          left=left,
                                          right=walk (right)}))
in
  makeBlack (walk (t))
end

This code walks back up the tree from the point of insertion fixing the invariants at every level. At red nodes we don't try to fix the invariant; we let the recursive walk go back until a black node is found. When the walk reaches the top the color of the root node is restored to black, which is needed if balance rotates the root.

Deletion of elements from a red-black tree is also possible, but requires the consideration of many more cases.

An important property of any balanced search tree, red-black trees included, is that it can be used to implement an ordered set easily. This is a set that keeps its elements in some sorted order. Ordered sets generally provide operations for finding the minimum and maximum elements of the set, for iterating over all the elements between two elements, and for extracting ordered subsets of the elements between a range.