Skip to main content


Dataflow analysis

Control flow graphs for program analysis

We have seen that we can easily convert the lowered IR representation into a control flow graph (CFG) in which each node n has one of the following contents:

Node contents Lowered IR equivalent
An assignment xe MOVE(TEMP(x),e)
A memory store [e1]e2 MOVE(MEM(e1),e2)
A conditional branch if e CJUMP(e,e2)
A start node start
A return node return e1,,em.
Expressions e can be any of the following, except that function calls can appear only at top level: e::=e1OPe2|  f(e1,en)|  [e]OP::=+  |   |   |  /  |  mod  |  lshift |  rshift |  

Other nodes such as LABEL and JUMP are represented by the graph structure. (Assuming that we can identify all the possible values of e as nodes in the CFG, we can treat a node JUMP(e) where e is a complex expression as a fancy kind of if node in which there are any number of exit edges.) It is handy for describing some analyses to have a distinguished start node, though often it is omitted when drawing CFGs.

A node has zero or more entry edges and either one or two exit edges; only if has more than one exit edge, as depicted in the figure below.

Program CFGs often have linear sequences of several nodes where each node except the first has a single entry edge, and each node except that last has a single exit edge. Such a sequence of nodes is a basic block. It is possible to do program analysis over a CFG of basic blocks instead of over individual nodes. This complicates the analysis, but gives the same result. For some analyses, it is possible to speed up analysis somewhat by performing it at the granularity of basic blocks.

Dataflow equations

A variable is live on exit from a node if it is live on entry to any successor node . And it is live on entry to the node if it is used by the node, or if it is live on exit from the node and this node does not redefine the value of the variable. These observations are captured by the following dataflow equations. Note that to denote that node n is a successor of n, we write either nsucc(n), or nn.

in[n]=use[n](out[n]def[n])out[n]=nnin[n]

The functions use[n] and def[n] are defined according to the following table, where vars(e) refers to the set of variables used in the expression e:

nuse[n]def[n]xevars(e)x[e1]e2vars(e1)vars(e2)if evars(e)startreturn eivars(ei)

Worklist algorithm

How do we solve the dataflow equations? A first step is to substitute the definition of out[n] into that of in[n], eliminating half the equations:

in[n]=use[n](nnin[n]def[n])

If we want to use this dataflow equation to define the value of in[n], it is clear that the information is flowing backward along the arrows in the CFG. For this reason, we say that live variable analysis is a backward analysis.

The usual way to solve dataflow equations is the worklist algorithm, which uses a FIFO queue called the worklist to keep track of nodes whose equations might not be satisfied at any given step. The algorithm is as follows for a backward analysis such as live variable analysis:

  1. Initialize the worklist to contain all nodes.
  2. Initialize the value of in[n] to some initial value (for live variable analysis, ).
  3. While the worklist contains some node n:
    • Remove n from the worklist.
    • Set the value of in[n] using the dataflow equation. For live variable analysis: in[n]use[n](nnin[n]def[n])
    • If in[n] changed, push all predecessors of n, whose equiations might have been invalidated, onto the worklist if they are not already there.

Now let's look at why this algorithm works. Each node has its own dataflow equation, so for N total nodes, there are N dataflow equations. The worklist algorithm simply chooses to apply these equations iteratively in a particular order to update the value of in[n], until convergence. What we haven't explained is why this order of application ends in the correct result.

Monotonicity

The dataflow equation for a given node n has an interesting monotonicity property: adding more elements to in[n] for its successor nodes n can only add elements to (or have no effect on) the value of in[n] according to the equation. Think about an iteration of the algorithm that updates in[n]. There was some previous iteration that updated in[n] (which might be the original initialization to ∅). Suppose all changes that have occurred to the values of in[n] for successor nodes n have been to add elements. In that case this update, if it has any effect, must also add elements.

Clearly the very first iteration of the worklist algorithm can only add elements (or have no effect), since in[n]=. Therefore the second iteration of the algorithm can also only add elements, since all prior iterations have only added elements. Inductively, we can see that every iteration of the worklist algorithm, when using the live variable analysis dataflow equation, can only add elements.

Therefore a given set in[n] only increases in size during the execution of the worklist algorithm.

Complexity

There is a finite number of variables (call it V), and the set in[n] can only grow, so the maximum number of times it can change during execution of the algorithm is V. How many times can the main loop of the algorithm execute? Once for each time a node is pushed onto the worklist. How many times can a node be pushed onto the worklist? Once at the beginning, and once for each change to one of its successor nodes. Since a node has at most two successors, this means at most 2V+1 pushes. Therefore the running time is O(VN), or O(N2). With some reasonable assumptions about the structure of control flow graphs for real programs, this can be lowered to O(dN) where d corresponds to the loop nesting depth of the code, and is typically no more than 4 or so.

Correctness

We've just seen that the algorithm must terminate. If it does terminate, we would like to know that all the dataflow equations are satisfied. To see this, notice that the worklist algorithm maintains a loop invariant: every node that is not on the worklist has its equation satisfied. Clearly this invariant holds at the beginning because all nodes are on the worklist. Each time that a node n is removed from the worklist, it is because its equation is satisfied by updating in[n]. And each time this is done, the nodes whose equations might have become unsatisfied (the predecessors) are pushed onto the worklist.

Therefore, when the worklist is empty, all dataflow equations are satisfied.

Available copies analysis

You may be suspecting that we can generalize the live variable dataflow analysis into a general framework for dataflow analysis. Before we try to do that, let's look at another dataflow analysis so we can identify what is in common. We will consider an analysis called available copies, which keeps track of variable copies. This is useful for doing copy propagation optimizations. It is really a special case of a more general analysis called available expressions, which we will see later.

Copy propagation

The idea of this optimization is that we replace variables with other variables known to contain the same information. This means we aren't wasting registers on redundant information. For example, in the code below, the variables "x" and "y" hold the same value after the assignment. Assuming there are no assignments to either variable between that point and the assignment to "z", we can replace the use of "x" with "y", as shown on the right. If that transformation means the variable "x" is no longer live, the assignment "x=y" becomes dead code and can be removed.

x = y
...
z = 2*x + 1
x = y
...
z = 2*y + 1

Dataflow values

The information associated with each program point will be a set of equalities known to hold between different variables. Such a set might look like {x1=y1,x2=y2,,xn=yn}. In addition, the ys are variables that were assigned their values earlier than the corresponding xs. We can use a set of equalities like this to determine whether two variables are definitely known to be equal, which is the information copy propagation needs.% \footnote{A better but more complicated representation is to keep track of the equivalence classes of variables, which allows more equalities to be proved. This can be done by ensuring that none of the variables yi appears on the LHS of an equality. Then each variable yi stands for the equivalence class of variables that are known to be copies; path compression is used to support efficient updating and testing of equality. This approach is equivalent to value numbering.}

Dataflow equations

An equation only holds on entry to a node if it holds on exit from all predecessor nodes. Therefore, we have the following equation: in[n]=nnout[n] An equation holds on exit from a node n if either it is established by the node (we use gen[n] to represent the equalities introduced by node n), or if the equation held on entry to the node, but the node makes the equation untrue (we use kill[n] to represent such equalities). The equation for out[n] is therefore: out[n]=gen[n](in[n]kill[n]) The functions gen[] and kill[] are defined by the following table:
n gen[n] kill[n]
x=y {x=y} x=z,z=x for all z
xe where ey x=z,z=x
[e1]e2
if e
start all nodes
return e

Worklist algorithm

We can see that available copies is a forward analysis because the value at entry to a node depends on the predecessor nodes. We update the worklist algorithm given earlier by using predecessors where it used successors, and vice versa, and by using out[n] where it used in[n]:
  1. Initialize the worklist to contain all nodes.
  2. Initialize the value of out[n] to some initial value (for available copies variable analysis, the set of all possible equalities).
  3. While the worklist contains some node n:
    • Remove n from the worklist.
    • Set the value of out[n] using the dataflow equation: in[n]nnout[n]out[n]gen[n](in[n]kill[n])
    • If out[n] changed, push the successors of n onto the worklist.
This analysis has the same monotonicity property as live variable analysis, but the space of possible values is larger. Therefore the algorithm terminates but can take asymptotically longer.

Dataflow analysis framework

We can characterize both of these analyses within a common framework introduced by Gary Kildall (also known as the creator of the pioneering CP/M operating system for personal computers). A common dataflow framework has the advantage that it allows us to quickly decide whether a given analysis is guaranteed to complete, what its complexity is, and whether it always computes the best solution possible. In addition, a common framework for dataflow analysis enables us to implement a general algorithm for doing analyses in the compiler, rather than reimplementing each analysis from scratch.

Dataflow analysis components

A dataflow analysis framework has four components: the direction of analysis (forward or backward), the values being propagated, transfer functions for each of the nodes, and a meet operator.

Dataflow values

A key component of a dataflow analysis framework is the set of values that the analysis is computing on. In live variable analysis, the values were themselves sets of variables. In available copies, the values were sets of equalities. Let L be the set of all values that can be assigned to a program point. We will use to denote a single value contained in L.

What do dataflow values mean? We can usefully think of them as representing some proposition that must hold at the program point they are attached to. (We can also think of them as representing propositions that may hold at the program point, but this is just the same thing as saying the negation of the proposition must hold.) For example, in live variable analysis, the meaning of the set of variables attached to a program point is that all the variables in the set must be live at that point.

In the space of values there is one value that conveys the greatest possible information. We will denote this value by the symbol . In live variable analysis, the greatest information is conveyed by the empty set ∅. It means no variables are live, and therefore that the entire program is dead code.

Transfer functions

The second part of a dataflow analysis is a set of transfer functions that describe how dataflow values are transformed by a node. For a forward analysis, the transfer function defines how out[n] depends on in[n]. For a backward analysis, it's the reverse.

Meet operator

The final part of a dataflow analysis which defines how to combine values from multiple incoming edges. As depicted in the figure, supposing we have propositions corresponding to l1, l2, and l3 on various edges. At a common program point where all these edges meet, we don't know which edge we came in on, so at most we know the disjunction (or) of the three propositions. The meet operator ⊓ defines how to construct the value corresponding to this disjunction. For live variable analysis, the meet operator is union; for available copies analysis, it is intersection.

Summary

We've seen that a dataflow analysis framework can be characterized as a four-tuple (D,L,,F): the direction of analysis D, the space of values L, transfer functions Fn, and the meet operator ⊓. We're not yet guaranteed that the worklist algorithm works, however.

With some reasonable conditions on L, Fn, and , the worklist algorithm is correct and efficient, and computes the best possible answer to the dataflow equations.