Yesterday you saw red-black trees as an example of a balanced binary search tree. A balanced tree, as you recall, is a tree that tries to approximate the shape of a complete tree. A complete tree of n nodes has depth at most log n, so the basic tree operations (find, insert, and delete) take only O(log n) time rather than the potentially linear time required for an unbalanced tree. To review:
These requirements force the longest root-leaf path to be no more than twice the length of the shortest such path, since the shortest possible path is a string of black nodes, and the longest possible path with the same number of black nodes uses alternating red nodes as extensions. But how does this guarantee that we can perform tree operations in O(log n) time? We'd like to prove that the height of a tree of n nodes is at most 2 log(n+1), which we can do by induction.
First define the black-height of a node x (denoted bh(x)) to be the number of black nodes on any path from, but not including, x to a black leaf. Just as the height of a tree is the maximum number of edges in any root-leaf path, you can think of the black height as the number of edges to black children in any root-leaf path.
We want to show that the subtree rooted at any node x contains at least 2^(bh(x)) - 1 internal (non-leaf) nodes. We'll do this by induction on the height of x.
If the height of x is 0, then x is a black leaf and must have black-height 0. 2^(bh(x)) - 1 = 0. The subtree rooted at x has at least 0 internal nodes, so we've proven the base case.
For the inductive step, consider a node x with positive height. (So x is an internal node.) As we said above, each child has black-height bh(x) or bh(x-1), depending on the child's color. Since the height of a child of x is less than the height of x, the inductive hypothesis says that each child has at least 2^(bh(x)-1) - 1 internal nodes, so the subtree rooted at x has at least 2 * (2^(bh(x)-1) - 1) + 1 = 2^(bh(x)) - 1 internal nodes. This proves the claim.
We have shown that the subtree rooted at any node x contains at least 2^(bh(x)) - 1 internal nodes. Now let h be the height of a tree. The black-height of the root must be at least h/2 (why?). Applying our result to the root, we see that n >= 2^(h/2) - 1. Algebraic manipulation yields h <= 2 log (n+1), which is what we wanted.
Thus the height of a red-black tree containing n nodes is O(log n), so we can find an arbitrary element in O(log n) time.
Red-black trees are only one of several types of balanced search trees. Another example is the 2-3 tree, which works by requiring every root-leaf path to have the same length, but some nodes may have three subtrees (and two values) rather than two subtrees (and one value). More general forms of the same idea are the 2-3-4 tree and the B-tree. Today we're going to show how we can implement 2-3 trees in ML. Just as we did yesterday, we're going to assume that the values in the tree are integers (extending this to polymorphic trees is trivial), and we're only going to implement insertion. (Finding an element is obvious, and deletion is very messy.)
As usual, when we do an insertion, we recursively search the tree until we find the location where the new value belongs, which will always be at a leaf node. Insertion into an empty tree is easy -- just create a node containing the one value. And insertion into a small node is easy -- promote it to a big node. The hard part is insertion into a big node. This requires us to split the node into two small nodes and create a new small parent for them. But this increases the depth of the tree at this point, so we have to "bubble up" the new parent until either it is absorbed by another small node or it gets to the root and we have to increase the depth of the tree.
For example, consider this tree:
9,19 / | \ 3 13 25,29 / \ / \ / | \ 1 5,7 11,12 15,17 21,23 26,27 33,35
To insert the number 2, we simply traverse the tree until we reach the node where it belongs. Since this node is small, we give it a second value and we're done.
Alternatively, to insert the number 14, we don't have room in the [15,17] node to add it, so we locally restructure it as
15 / \ 14 17
and then we have to bubble the 15 up to the next level, where it is absorbed into the small 13 node, producing
13,15 / | \ 11,12 14 17
But to insert the number 28, we have to keep bubbling until we reach the root. The result is that the new root will contain only 19, and its children will have values 9 and 27.
To implement this idea of bubbling up an extra value until a small node can absorb it, we'll define the datatype of 2-3 trees with an extra data constructor:
datatype tree23 = E | Tr2 of tree23 * int * tree23 | Tr3 of tree23 * int * tree23 * int * tree23 | Put of tree23 * int * tree23 exception BadTree
The exception is available if we find an ill-formed tree. Because we have nodes with two keys now, ordinary comparison functions (returning an order) are insufficient, so we'll create a new datatype that we can use for this:
datatype interval_order = BELOW | BETWEEN | ABOVE | EQ
Assume we have a function compare2 of type int*int -> int ->interval_order for comparing an integer with an interval like this. Remember that a Put represents two subtrees and a value that needs to be absorbed or bubbled up. We'll write a function tr2 that takes a small node (the same type as a parameter to the Tr2 data constructor) and absorbs any Put subtree:
fun tr2(t:tree23*int*tree23):tree23 = case t of (Put(t1,a,t2),b,t3) => Tr3(t1,a,t2,b,t3) | (t1,a,Put(t2,b,t3)) => Tr3(t1,a,t2,b,t3) | _ => Tr2(t)
Similarly, we can write a function tr3 that bubbles up Put subtrees:
fun tr3(t:tree23*int*tree23*int*tree23):tree23 = case t of (Put(t1,a,t2),b,t3,c,t4) => Put(Tr2(t1,a,t2),b,Tr2(t3,c,t4)) | (t1,a,Put(t2,b,t3),c,t4) => Put(Tr2(t1,a,t2),b,Tr2(t3,c,t4)) | (t1,a,t2,b,Put(t3,c,t4)) => Put(Tr2(t1,a,t2),b,Tr2(t3,c,t4)) | _ => Tr3(t)
Now we'll write a helper function put that adds a value to a tree. It's almost what we want insert to be.
fun put (t:tree23) (n:int):tree23 = case t of E => Put(E,n,E) | Tr2(t1,a,t2) => (case Int.compare(n,a) of EQUAL => t | LESS => tr2(put t1 n, a, t2) | GREATER => tr2(t1, a, put t2 n)) | Tr3(t1,a,t2,b,t3) => (case compare2 (a,b) n of EQ => t | BELOW => tr3(put t1 n, a, t2, b, t3) | BETWEEN => tr3(t1, a, put t2 n, b, t3) | ABOVE => tr3(t1, a, t2, b, put t3 n)) | _ => raise BadTree
The only problem is that put may propagate a Put node to the root of the tree. We'll need to change that to an ordinary Tr2 node; this is how we increase the depth of the tree.
fun insert (t:tree23) (n:int):tree23 = let val t' = put t n in case t' of Put(t1,a,t2) => Tr2(t1,a,t2) | _ => t' end
Cormen, Leiserson, and Rivest. Introduction to Algorithms.
Reade, Chris M.P. "Balanced trees with removals: an exercise in rewriting and proof." Science of Computer Programming, 18(2):181-204, April 1992.
The example 2-3 tree used is from some course notes copyright © 1997, Martin Beaugrand, Tomozumi Kanayama, Marc-Andre Sauve, and Marc van Kralingen.