Lesson 8: Loop Optimization

discussion thread
video
tasks due March 22

Gist

Recall from Lesson 5: identifying natural loops in the CFG.
Loop optimization is important because programs, by definition, spend most of their time executing loops! (A program without any loops can’t run for very long.)

Loop-Invariant Code Motion

Also from Lesson 5, recall the loop-invariant code motion optimization, which transforms code like this:

let a = ...;
let b = ...;
for (let i = 0; i < 100; ++i) {
    f(a * b);
}

Into this, by moving code that does the same thing on every iteration to the loop’s pre-header block:

let a = ...;
let b = ...;
let c = a * b;
for (let i = 0; i < 100; ++i) {
    f(c);
}

Induction Variable Elimination

Induction variable elimination is an optimization that generally targets array indexing (and similar computations). Imagine this loop that accesses an array:

for (let i = 0; i < 100; ++i) {
    f(a[i]);
}

which, in C terms, is equivalent to:

for (let i = 0; i < 100; ++i) {
    f(a + i * stride);
}

where a is a base pointer and stride is the size of the array’s elements. Induction variable elimination rewrites the code to just “bump the pointer” instead of multiplying every time:

let a_100 = a + 100 * stride;
for (let a_i = a; a_i < a_100; a_i += stride) {
    f(a_i);
}

This code only has to do an addition in the loop—it has eliminated a multiplication.

The basic idea is to find affine expressions like a + i * stride that depend on a loop induction variable (i here) and some loop-invariant values (a and stride here). Here’s the recipe:

Given a natural loop, identify the basic loop induction variable(s), like i in this example.
- These are variables that get have a single in-loop definition that looks like i += e for some loop-invariant value e.
- Of course, there is no such thing as += in SSA form so identifying this kind of assignment requires traversing ϕ-nodes.
Find derived induction variables.
- You’re looking for an affine expression like j = c * i + d where i is a basic induction variable and c and d are loop-invariant values.
- Record the stride c, offset d, and the basic induction variable i for every induction variable you find this way.
Replace the definition of each derived induction variable.
- Assuming the induction variable is j = c * i + d where i is a basic induction variable that gets updated like i += e…
- In the preheader, initialize j with d. (That’s the “equation” for j with i set to zero.)
- In the loop (after the update to i), replace the definition of j with an update like j += c * e (where that latter multiplication is a loop-invariant value). (And again, this will look a bit different in SSA. And in non-SSA form, you will need to be careful to compute the new value of j right after the update to i but without clobbering other, pre-update uses of j.)

After all this, you’ll want to do some copy propagation and dead code elimination to, for example, eliminate basic induction variables that are no longer necessary.

This optimization is also sometimes called strength reduction for induction variables (but I think that’s confusing because strength reduction is a much more general concept that doesn’t by itself have anything to do with loops). For more, check out some slides from CMU on induction variable elimination and two blog posts that cover this topic from CS 6120 in 2019: one by Yi Jing, Zhijing Li, and Neil Adit and another by Shaojie Xiang, Yi-Hsiang Lai, and Yuan Zhou.

Loop Unswitching

Loop unswitching “lifts” conditions out of loops, creating two loops. So this code:

for (let i = 0; i < 100; ++i) {
    if (c) {  // Loop-invariant value.
        f();
    } else {
        g();
    }
}

Becomes:

if (c) {
    for (let i = 0; i < 100; ++i) {
        f();
    }
} else {
    for (let i = 0; i < 100; ++i) {
        g();
    }
}

The analysis, of course, requires identifying loop-invariant code. Then the heavily lifting in the transformation is “just” rearranging the CFG and duplicating the loop’s subgraph. For more, see a blog post by Sameer Lal from CS 6120 in 2019.

Loop Fusion & Fission

Loop fusion or loop jamming takes two sequential loops and combines them into one. Its main purpose again has to do with arrays, particularly when feeding the output from one loop into the input of the next loop, like this:

for (let i = 0; i < 100; ++i) {
    b[i] = f(a[i]);
}
for (let i = 0; i < 100; ++i) {
    c[i] = g(b[i]);
}

Fusing these loops yields one that calls both f and g:

for (let i = 0; i < 100; ++i) {
    b[i] = f(a[i]);
    c[i] = g(b[i]);
}

Then, the real win comes if the intermediate array b is “dead” outside of this context, in which case we can eliminate it:

for (let i = 0; i < 100; ++i) {
    c[i] = g(f(a[i]));
}

In this case, fusion eliminated a lot of memory accesses (and perhaps even reduced the program’s heap footprint, if we no longer even have to allocate b). This optimization is extremely important in machine learning compilers and other data-intensive domains—for example, it is far more efficient to fuse a fully-connected layer with a ReLU nonlinearity than it is to compute them serially, with a temporary tensor in between.

The opposite transformation, loop fission or distribution, can be useful to optimize for locality—i.e., to make multiple, smaller loop bodies that each have a smaller memory footprint.

The key analysis you need in both case is independence, i.e., being certain that f and g do not interfere with each other so it’s legal to change the order they execute in.

Global Optimizations Galore!

This is not an exhaustive list of loop optimizations—there are so many more! For example:

Loop unrolling (duplicating the loop body to reduce the numer of branches executed at the expense of code size).
Loop interchange (replacing row-major iteration with column-major iteration, for example).
Loop splitting or peeling (carving off the first few iterations of the loop and running them separately, leaving you with a simpler main loop body).
Loop transformations for data locality optimization (see also Kathryn McKinley’s slides on this topic).

See overviews of these and still more in Steve Chong’s slides on loop optimizations.

More broadly, this is where we will stop going down the standard function-level optimization rabbit hole. Before the class moves on to farther-flung topics in language implementation, for those who are curious about this sort of thing, I wanted to leave you with a few links to read about even more fancy global optimizations:

Global value numbering, the turbocharged global version of local value numbering, which you will remember from Lesson 3. See also this CS 6120 blog post by Alexa VanHattum and Gregory Yauney.
Partial redundancy elimination.
Lazy code motion. See also these slides from CMU.

Tasks

Your task is to implement and evaluate a loop optimization.

First, pick Bril or LLVM as your starting point.
- You can work in the SSA form of Bril if you want.
- If you pick Bril, you’ll need to find the natural loops yourself. If you pick LLVM, you can use a LoopPass to skip that step, but of course other parts of the implementation will be trickier.
Then, pick your optimization. You can either:
- Choose one of the loop-based optimizations mentioned above in this lesson. (Including one of the ones I just mentioned at the end without describing in detail.) (If you’re having trouble deciding, I recommend LICM.)
- Propose your own. (DM me on Zulip before you start working on it to clear your idea.)
Implement your optimization.
Rigorously evaluate its performance impact. See the SIGPLAN empirical evaluation guidelines for the definition of “rigorous.”
- In Bril, you can use the Bril benchmarks.
- If you choose LLVM, select an existing (small!) benchmark suite such as Embench. Please violate the SIGPLAN guideline about using complete benchmark suites, an cherry-pick a convenient subset, to make this evaluation tractable.