Dataflow analysis
Control flow graphs for program analysis
We have seen that we can easily convert the lowered IR
representation into a control flow graph (CFG) in which each node
Node contents | Lowered IR equivalent |
An assignment | |
A memory store | |
A conditional branch | |
A start node | — |
A return node | — |
Other nodes such as if
node in which there
are any number of exit edges.) It is handy for describing some
analyses to have a distinguished
A node has zero or more entry edges and either one or two exit edges;
only if
has more than one exit edge, as depicted in
the figure below.
Program CFGs often have linear sequences of several nodes where each node except the first has a single entry edge, and each node except that last has a single exit edge. Such a sequence of nodes is a basic block. It is possible to do program analysis over a CFG of basic blocks instead of over individual nodes. This complicates the analysis, but gives the same result. For some analyses, it is possible to speed up analysis somewhat by performing it at the granularity of basic blocks.
Dataflow equations
A variable is live on exit from a node if it is live on entry to any
successor node . And it is live on entry to the node if it is used by
the node, or if it is live on exit from the node and this node does
not redefine the value of the variable. These observations are
captured by the following dataflow equations. Note that to denote
that node
The functions
Worklist algorithm
How do we solve the dataflow equations? A first step is
to substitute the definition of
If we want to use this dataflow equation to define the value of
The usual way to solve dataflow equations is the worklist algorithm, which uses a FIFO queue called the worklist to keep track of nodes whose equations might not be satisfied at any given step. The algorithm is as follows for a backward analysis such as live variable analysis:
- Initialize the worklist to contain all nodes.
- Initialize the value of
to some initial value (for live variable analysis, ). - While the worklist contains some node
:- Remove
from the worklist. - Set the value of
using the dataflow equation. For live variable analysis: - If
changed, push all predecessors of , whose equiations might have been invalidated, onto the worklist if they are not already there.
- Remove
Now let's look at why this algorithm works. Each node has its
own dataflow equation, so for
Monotonicity
The dataflow equation for a given node
Clearly the very first iteration of the worklist algorithm can only
add elements (or have no effect), since
Therefore a given set
Complexity
There is a finite number of variables (call it
Correctness
We've just seen that the algorithm must terminate. If it does
terminate, we would like to know that all the dataflow equations are
satisfied. To see this, notice that the worklist algorithm maintains
a loop invariant: every node that is not on the worklist has its equation
satisfied. Clearly this invariant holds at the beginning because all
nodes are on the worklist. Each time that a node
Therefore, when the worklist is empty, all dataflow equations are satisfied.
Available copies analysis
You may be suspecting that we can generalize the live variable dataflow analysis into a general framework for dataflow analysis. Before we try to do that, let's look at another dataflow analysis so we can identify what is in common. We will consider an analysis called available copies, which keeps track of variable copies. This is useful for doing copy propagation optimizations. It is really a special case of a more general analysis called available expressions, which we will see later.
Copy propagation
The idea of this optimization is that we replace variables with other variables known to contain the same information. This means we aren't wasting registers on redundant information. For example, in the code below, the variables "x" and "y" hold the same value after the assignment. Assuming there are no assignments to either variable between that point and the assignment to "z", we can replace the use of "x" with "y", as shown on the right. If that transformation means the variable "x" is no longer live, the assignment "x=y" becomes dead code and can be removed.
x = y ... z = 2*x + 1⇒
x = y
...
z = 2*y + 1
Dataflow values
The information associated with each program point will be a set of equalities known to hold between different variables. Such a set might look likeDataflow equations
An equation only holds on entry to a node if it holds on exit from all predecessor nodes. Therefore, we have the following equation: | | |
| | |
| | |
| | |
| | |
| | all nodes |
| | |
Worklist algorithm
We can see that available copies is a forward analysis because the value at entry to a node depends on the predecessor nodes. We update the worklist algorithm given earlier by using predecessors where it used successors, and vice versa, and by using- Initialize the worklist to contain all nodes.
- Initialize the value of
to some initial value (for available copies variable analysis, the set of all possible equalities). - While the worklist contains some node
:- Remove
from the worklist. - Set the value of
using the dataflow equation: - If
changed, push the successors of onto the worklist.
- Remove
Dataflow analysis framework
We can characterize both of these analyses within a common framework introduced by Gary Kildall (also known as the creator of the pioneering CP/M operating system for personal computers). A common dataflow framework has the advantage that it allows us to quickly decide whether a given analysis is guaranteed to complete, what its complexity is, and whether it always computes the best solution possible. In addition, a common framework for dataflow analysis enables us to implement a general algorithm for doing analyses in the compiler, rather than reimplementing each analysis from scratch.
Dataflow analysis components
A dataflow analysis framework has four components: the direction of analysis (forward or backward), the values being propagated, transfer functions for each of the nodes, and a meet operator.
Dataflow values
A key component of a dataflow analysis framework is the set of values
that the analysis is computing on. In live variable analysis, the values
were themselves sets of variables. In available copies, the values
were sets of equalities. Let
What do dataflow values mean? We can usefully think of them as representing some proposition that must hold at the program point they are attached to. (We can also think of them as representing propositions that may hold at the program point, but this is just the same thing as saying the negation of the proposition must hold.) For example, in live variable analysis, the meaning of the set of variables attached to a program point is that all the variables in the set must be live at that point.
In the space of values there is one value that conveys the greatest
possible information. We will denote this value by the symbol
Transfer functions
The second part of a dataflow analysis is a set of transfer functions
that describe how dataflow values are transformed
by a node. For a forward analysis, the transfer function defines
how
Meet operator
The final part of a dataflow analysis which defines how to combine
values from multiple incoming edges. As depicted in the figure,
supposing we have propositions corresponding to
Summary
We've seen that a dataflow analysis framework can be characterized as
a four-tuple
With some reasonable conditions on