Lecture 12: Imperative Data Structures: Disjoint Sets

Given a set of elements S, a partition of S is a set of nonempty subsets of S such that every element of S is in exactly one of the subsets. In other words the subsets making up the partition are pairwise disjoint, and together contain all the elements of S (cover S). A disjoint set data structure is an efficient way of keeping track of such a partition. We are interested in two operations on disjoint sets:

  1. union - merge two sets of the partition into one, changing the partition structure
  2. find - determine which set of the partition contains a given element e, returning a canonical element of that set

Sometimes a disjoint set is also referred to as a union-find data structure because it supports these two operations. In addition, the create operation makes a partition where each element e is in its own set (all the subsets in the partition are singletons).

Efficient implementations of the union and find operations make use of the ability to change the values of variables, thus we make use of refs and arrays introduced in recitation.

Disjoint sets are commonly used in graph algorithms. For instance, consider the problem of finding the connected components in an undirected graph (sets of nodes that are reachable from one another by some path). The following algorithm will label all the nodes in each component with the same identifier and nodes in different components with different identifiers:

  1. Create a new partition with one element corresponding to each node v in the graph.
  2. For each edge (u,v) in the graph, call the union operation with u and v.
  3. For each vertex v in the graph, call the find operation, which returns the component label for that vertex.

Representing Forests as Arrays

A common way of representing a disjoint set is as a forest, or collection of trees, with one tree corresponding to each set of the partition. When the nodes of the trees are labeled by consecutive natural numbers 0 through n − 1, it is straightforward to implement a forest using an array of length n, where the array index corresponds to the node and the array entry at that index specifies the node's parent. The root of a tree specifies itself as parent.

For instance, the forest:


                   0

               /   |   \                         6

              /    |    \                      /   \

             1     2     3                    7     8

                       /   \

                      4     5

would be represented by the array


+---+---+---+---+---+---+---+---+---+

| 0 | 0 | 0 | 0 | 3 | 3 | 6 | 6 | 6 |

+---+---+---+---+---+---+---+---+---+

With this representation of a disjoint set, a new partition is simply:


let createUniverse (size : int) : universe =

  Array.init size (fun i -> i)

Using this representation, the find operation checks the specified index of the array. If the value is equal to the index, it returns the index; the root has been found. Otherwise, it recursively calls find with the value in the array, which is the index of the parent. This searches from a node to the root of its tree in the forest:


let rec find s e =

  let p = s.(e) in

    if p = e then e

    else find s p

The union operation finds the roots of the trees for each of the two elements, then assigns one of the two roots to have the other as parent:


let union s e1 e2 =

  let r1 = find s e1

  and r2 = find s e2 in

    s.(r1) <- r2

Thus union simply does two finds and a pointer update. The asymptotic running time of union is thus the same as that of find. In the worst case, find can take O(n) time for an array of n elements, because the forest could consist of a single tree with a single path of depth n. In that case, starting at the leaf, find would visit every element of the array before reaching the root. Thus, as with the balanced binary tree schemes such as red-black trees, we need a balancing scheme to keep the height small, preferably at most logarithmic in the number of nodes.

With this representation using parent pointers, trees are relatively easy to balance, because they are not necessarily binary trees; there is no bound on the branching factor, and we can exploit this fact to keep the height down. The key trick is to make the shorter tree a child of the root of the taller in the union operation. If tree t1 is strictly shorter than t2, and if t1 is made a child of the root of t2, then the overall height of the resulting tree does not change. If the two trees are the same height, then the height increases by 1, but this is the only way the height can increase. Rather than using the actual height of the trees, we use a quantity referred to as the rank, which is an upper bound on the height. This balancing scheme is known as union by rank.

Our data structure now needs to store a rank for each node in addition to a parent pointer.


type node = {mutable parent : int; mutable rank : int}

type universe = node array



let createUniverse size =

  Array.init size (fun i -> {parent = i; rank = 0})

Now union finds the roots of the trees for both elements as before, except now we may also need to adjust the ranks. If the two roots are the same, there is nothing to do. If they are different, then the one with smaller rank is made a child of the root of the one with larger rank. If the ranks are equal, it does not matter which one is made a child of the other, but the rank of the root is incremented by 1.


let union (s : universe) (e1 : int) (e2 : int) : unit =

  let r1 = find s e1

  and r2 = find s e2 in

  let n1 = s.(r1)

  and n2 = s.(r2) in

    if r1 != r2 then

      if n1.rank < n2.rank then

      n1.parent <- r2

      else

      (n2.parent <- r1;

      if n1.rank = n2.rank then

        n1.rank <- n1.rank + 1)

This process for constructing trees results in ranks that are logarithmic in the number of nodes, thus the running time of union and find operations is O(log n) for n nodes. This follows from the fact that if a node has rank n, then the subtree rooted at that node has at least 2n nodes. We can prove this by induction on rank. Base case: a node of rank 0 is the root of a subtree that contains at least itself, thus has size at least 20 = 1. Inductive step: we wish to show that a tree of rank k + 1 has at least 2k+1 nodes. A node u can have rank k + 1 only if, at some point in the past, it had rank k and it was joined with another tree also of rank k, and u became the root of the union of the two trees. By the induction hypothesis, each tree was of size at least 2k, so u is the root of a tree of size at least 2k + 2k = 2k+1.

The find operation can also be improved using a technique known as path compression. After doing a find starting from a node u, we can retrace the path from u up to the root and change all the parent pointers along the way to point directly to the root. This will pay off in subsequent finds starting at any node along that path. In effect, this makes the tree flatter and bushier.


let rec find (s : universe) (e : int) : int =

  let n = s.(e) in

    if n.parent = e then e

    else

      (n.parent <- (find s n.parent);

       n.parent)

A more involved analysis can establish that with union by rank and path compression, any sequence of m union and find operations on a set with n elements take at most O((m+n) log* n) steps. The function log* n is the inverse of the function 222...2, where the stack of 2's is of height n. This is an extremely fast-growing function—with a stack of 2's of height 5, the value is a decimal number with 19728 digits. Its inverse, the function log* n, is the number of times you have to apply the log function to n before you get a number less than or equal to 1, and is no more than 5 for all practical values of n. Such more detailed analyses are covered in CS 4820.