Given a set of elements S, a partition of S is a set of nonempty subsets of S such that every element of S is in exactly one of the subsets. In other words the subsets making up the partition are pairwise disjoint, and together contain all the elements of S (cover S). A disjoint set data structure is an efficient way of keeping track of such a partition. We are interested in two operations on disjoint sets:
Sometimes a disjoint set is also referred to as a union-find data structure because it supports these two operations. In addition, the create operation makes a partition where each element e is in its own set (all the subsets in the partition are singletons).
Efficient implementations of the union and find operations make use of the ability to change the values of variables, thus we make use of refs and arrays introduced in recitation.
Disjoint sets are commonly used in graph algorithms. For instance, consider the problem of finding the connected components in an undirected graph (sets of nodes that are reachable from one another by some path). The following algorithm will label all the nodes in each component with the same identifier and nodes in different components with different identifiers:
union
operation
with u and v.
find
operation,
which returns the component label for that vertex.
A common way of representing a disjoint set is as a forest, or collection of trees, with one tree corresponding to each set of the partition. When the nodes of the trees are labeled by consecutive natural numbers 0 through n − 1, it is straightforward to implement a forest using an array of length n, where the array index corresponds to the node and the array entry at that index specifies the node's parent. The root of a tree specifies itself as parent.
For instance, the forest:
0 / | \ 6 / | \ / \ 1 2 3 7 8 / \ 4 5
would be represented by the array
+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 3 | 3 | 6 | 6 | 6 | +---+---+---+---+---+---+---+---+---+
With this representation of a disjoint set, a new partition is simply:
let createUniverse (size : int) : universe = Array.init size (fun i -> i)
Using this representation, the find
operation checks the
specified index of the array. If the value is equal to the index, it
returns the index; the root has been found. Otherwise, it
recursively calls find
with the value in
the array, which is the index of the parent.
This searches from a node to the root of its tree in the
forest:
let rec find s e = let p = s.(e) in if p = e then e else find s p
The union
operation finds the roots of the trees for each of the two
elements, then assigns one of the two roots to have the other as
parent:
let union s e1 e2 = let r1 = find s e1 and r2 = find s e2 in s.(r1) <- r2
Thus union
simply does two find
s and a pointer update.
The asymptotic running time of union
is thus the same as that of find
.
In the worst case, find
can take O(n) time for an array
of n elements, because the forest could consist of a single tree with a
single path of depth n. In that case, starting at the leaf, find
would visit every element of the array before reaching the root.
Thus, as with the balanced binary tree schemes such as
red-black trees, we need a balancing scheme to keep the height small, preferably
at most logarithmic in the number of nodes.
With this representation using parent pointers, trees are relatively easy
to balance, because they are not necessarily binary trees; there is no bound on the
branching factor, and we can exploit this fact to keep the height down.
The key trick is to make the shorter tree a child of the root of the taller in the union
operation. If tree t1 is strictly shorter than t2,
and if t1 is made a child of the root of t2,
then the overall height of the resulting tree does not change. If the two trees are the same
height, then the height increases by 1, but this is the only way the height can increase.
Rather than using the actual height of the
trees, we use a quantity referred to as the rank, which is an
upper bound on the height. This balancing scheme is
known as union by rank.
Our data structure now needs to store a rank for each node in addition to a parent pointer.
type node = {mutable parent : int; mutable rank : int} type universe = node array let createUniverse size = Array.init size (fun i -> {parent = i; rank = 0})
Now union
finds the roots of the trees for
both elements as before, except now we may also need to adjust the ranks.
If the two roots are the same, there is nothing
to do. If they are different, then the one with smaller rank is made
a child of the root of the one with larger rank. If the ranks are equal,
it does not matter which one is made a child of the other, but the
rank of the root is incremented by 1.
let union (s : universe) (e1 : int) (e2 : int) : unit = let r1 = find s e1 and r2 = find s e2 in let n1 = s.(r1) and n2 = s.(r2) in if r1 != r2 then if n1.rank < n2.rank then n1.parent <- r2 else (n2.parent <- r1; if n1.rank = n2.rank then n1.rank <- n1.rank + 1)
This process for constructing trees results in ranks that are
logarithmic in the number of nodes, thus the running time of union
and find
operations is O(log n) for n nodes.
This follows from the fact that if a node has rank n, then the subtree
rooted at that node has at least 2n nodes. We can prove this
by induction on rank. Base case: a node of
rank 0 is the root of a subtree that contains at least itself, thus
has size at least 20 = 1. Inductive step: we wish to show that a tree of
rank k + 1 has at least 2k+1 nodes.
A node u can have rank k + 1 only if, at some
point in the past, it had rank k and it was joined
with another tree also of rank k, and u became the root
of the union of the two trees. By the induction hypothesis, each tree
was of size at least 2k, so u is the root of a tree of size at least
2k + 2k = 2k+1.
The find
operation can also be improved using a technique known
as path compression. After doing a find
starting from a node u,
we can retrace the path from u up to the root and change all the parent pointers
along the way to point directly to the root. This will pay off in subsequent find
s starting
at any node along that path. In effect, this makes the tree flatter and bushier.
let rec find (s : universe) (e : int) : int = let n = s.(e) in if n.parent = e then e else (n.parent <- (find s n.parent); n.parent)
A more involved analysis can establish that with union by rank and path compression, any sequence of m union and find operations on a set with n elements take at most O((m+n) log* n) steps. The function log* n is the inverse of the function 222...2, where the stack of 2's is of height n. This is an extremely fast-growing function—with a stack of 2's of height 5, the value is a decimal number with 19728 digits. Its inverse, the function log* n, is the number of times you have to apply the log function to n before you get a number less than or equal to 1, and is no more than 5 for all practical values of n. Such more detailed analyses are covered in CS 4820.