Strength Reduction Pass in LLVM
Strength reduction is an optimization technique which substitutes expensive operations with computationally cheaper ones. For example, a very weak strength reduction algorithm can substitute the instruction b = a * 4
with b = a << 2
. For this course project, we implemented the loop strength reduction algorithm described in Prof. Pingali's lecture slides as an LLVM pass. Our pass identifies induction variables and reduces the expensive computation inside the loop to improve the execution time of the compiled program. The source code can be found here.
Methodology
Preprocessing
While it is possible to implement a strength reduction pass without any preprocessing to the code generated by the compiler front-end, it is much easier to perform effective loop strength reduction on properly optimized code. Aside from gathering sufficient information from the CDFG, memory-to-register promotion, constant propagation, copy propagation, and loop canonicalization greatly reduces the complexity of loop strength reduction. Because we are using LLVM to implement our pass, we leverage existing LLVM passes to perform these optimizations for us. The LLVM loop simplification pass canonicalizes the loop to have a preheader block, a header block, one single exit block and only one backedge. After executing this pass, all loops in the program have the same regular structure, and we leverage this structure to simplify our implementation.
Loop Strength Reduction
We illustrate our loop strength reduction algorithm using a simple example. The following code shows an unoptimized loop:
int i = 0; while( i < 10 ) { int j = 3 * i + 2; a[j] = a[j] - 2; i = i + 2; }
We can see that the integer j
is repeatedly computed in every iteration. Each computataion of j
uses one multiplication and one addition. However, since j
is actually linearly dependent on i
and i
increments by two in every iteration, it is possible to compute a proper initial value of j
outside the loop, and increment j
by a pre-computed stride in every iteration of the loop. With this optimization, the update of j
only uses one addition per iteration as shown in the following code snippet. The optimized code is computationally much cheaper than the unoptimized version.
int i = 0; int j = 2; // j = 3 * 0 + 2 while( i < 10 ) { a[j] = a[j] - 2; i = i + 2; j = j + 6; // j = j + 3 * 2 }
Finding Induction Variables
The first step of performing loop strength reduction is to locate all loop induction variables. We use a map to store the names of the induction variables and their corresponding coefficients. Our map has the following structure:
Map { Value => (basic induction variable, multiplicative factor, additive factor) }
For example, the basic induction variable i
in the example code will have an entry i => (i, 1, 0)
in the map, while the variable j
will have an entry j => (i, 3, 2)
after we finalize all the induction variables and their corresponding coefficients. We identify the basic induction variables by checking the Phi nodes in the header block of the loop, and then repeatedly scan the loop body to find all other induction variables using the following two rules:
- If we find an assignment of form
k = b * j
wherej
is an induction variable with triple(i, c, d)
, then addk => (i, b * c, d)
to the map; - If we find an assignment of form
k = j + b
wherej
is an induction variable with triple(i, c, d)
, then addk => (i, c, d + b)
to the map.
The algorithm runs until convergence, i.e., it will stop when the size of the induction variable set does not increase any more. For our simple example, in the first iteration it will add the basic induction variable i => (i, 1, 0)
into the map, while in the second iteration it will add j => (i, 3, 2)
into the map. The algorithm will terminate in the third iteration.
Update the Program
Computing the stride and initial value of each induction variable from each entry in our map is straightforward. Notice that for the initial value, we don't directly compute an integer value for each variable, but relate each induction variable with its basic induction variable. Then we can update the program to perform loop strength reduction with the following three steps:
- Initialize a non-basic induction variable with the computed initial value in the loop preheader;
- Add one Phi node for this non-basic induction variable into the loop header, and set the incoming value of the Phi node to the initial value we just computed;
- At the end of the loop body, insert an update instruction for this non-basic induction variable, and set it as the other incoming value of the Phi node we just inserted;
- Replace all uses of the original induction variable with the Phi node we inserted.
The procedure is repeated for each non-basic induction variable we have identified. After the optimization, dead code elimination is performed to clean up the instructions used for computing the original induction variables.
Implementation Details
Preprocessing
For the loop preprocessing, we create a function pass manger to include all necessary passes we want to apply. The pass manager is instantiated with the module encapsulating this LLVM function.
legacy::FunctionPassManager FPM(module); FPM.add(createConstantPropagationPass()); FPM.add(createIndVarSimplifyPass()); FPM.add(createDeadCodeEliminationPass()); FPM.add(createLoopSimplifyPass()); FPM.doInitialization(); bool changed = FPM.run(F); FPM.doFinalization();
We also run LoopInfoWrapperPass
to retrieve all the loop information.
void getAnalysisUsage(AnalysisUsage &AU) const { AU.setPreservesCFG(); AU.addRequired<LoopInfoWrapperPass>(); AU.addRequired<TargetLibraryInfoWrapperPass>(); }
Locate Basic Induction Variables
For each loop in the program, we create a map to record all induction variables as described earlier. For each loop, we find basic induction variables by simply collecting all the Phi nodes in the loop header:
map<Value*, tuple<Value*, int, int> > IndVarMap; // collect all basic indvars by visiting all phi nodes for (auto &I : *b_header) { if (PHINode *PN = dyn_cast<PHINode>(&I)) { IndVarMap[&I] = make_tuple(&I, 1, 0); } }
Notice that this step might collect more variables than the basic induction variables we need. However, notice that here we initialize all the collected variables as "basic induction variables". The variables not recognized by our induction variable analysis will remain "basic" and will not be touched when we perform the optimization.
Induction Variable Analysis
We collect all induction variables by repeatedly iterating through a loop, adding new induction variables to the map if the following conditions are satisfied:
- If the value is computed with an Add / Sub instruction on a constant and a induction variable that is already in the map;
- If the value is computed with an Mul instruction on a constant and a induction variable that is already in the map. The algorithm will iterate the loops until the map size does not increase any more.
Strength Reduction
We iterate through all Phi nodes in the loop header, and find all induction variables corresponding to that Phi node. For each induction variable, we will then insert instructions (one Mul and one Add) in the loop preheader to compute the initial value. A new Phi node will be created to replace the original Phi node. The mapping of the new Phi node and the induction variable is stored in a map. Notice that when computing the initial value of the induction variable, we need to retrieve the initial value of the basic induction variable (the one defined by the Phi node). Since we have performed loop simplification, here we can safely assume that all Phi nodes in the loop header have only two incoming values: one from the loop body and the other one from outside the loop. We always retrieve the latter as the initial value.
for (auto &I : *b_header) {
// we insert at the first phi node
if (PHINode *PN = dyn_cast<PHINode>(&I)) {
int num_income = PN->getNumIncomingValues();
assert(num_income == 2);
// find the preheader value of the phi node
for (int i = 0; i < num_income; i++) {
if (PN->getIncomingBlock(i) == b_preheader) {
preheader_val = PN->getIncomingValue(i);
} else {
b_body = PN->getIncomingBlock(i);
}
}
...
We then insert update instructions into the loop body, and set the output as an incoming value for the new Phi node we just inserted. After creating all the phi-nodes for induction variable substitution, we will update all the original values with the newly created PHINodes to complete the optimization.
Experiment Results
We evaluate our pass on the embench-iot benchmark suite, which is a benchmark suite designed to test the performance of deeply embedded systems. The strength reduction pass is performed on each program to evaluate its correctness and efficiency. Experiments are performed on a server with an 2.20GHz Intel Xeon processor and 128GB memory. All programs are single-thread.
To run the optimized program, we first use clang to emit LLVM IR of the original program with all optimization passes disabled. Then the IR is passed into LLVM opt, optimized with our pass and compiled into bitcode and objects. Finally we compile the object files into binary and run on physical machines. For each benchmark, there is one macro CPU_MHZ
to control the number of times the top-level benchmark function executes. We set this number to 1000 to ensure that each benchmark is executed for a sufficient number of times.
# generate LLVM IR for original program
clang -c -emit-llvm -O0 -Xclang -disable-O0-optnone benchmark.c $EMBENCH_DIR/support/*.c -I/path/to/benchmark \
-I$EMBENCH_DIR/support -DCPU_MHZ=1000
for f in *.bc
do
llvm-dis ${f}
done
for f in *.ll
do
# optimze with llvm opt
opt -S -load build/skeleton/libSkeletonPass.so -mem2reg -sr -dce ${f} -o opt_${f}
opt -S -load build/skeleton/libSkeletonPass.so -mem2reg -sr -dce ${f} -o opt_${f}.bc
llc -filetype=obj opt_${f}.bc;
done
gcc *.o -lm;
time ./a.out
The LLVM pass is first compiled into a shared library, which is loaded into LLVM opt as a compiler pass (with the -sr
argument, which is the custom name of our pass). The experiment results are shown in the following table. To make a fair comparison, we also commented out all the content except the preprocessing and postprocessing part in the code of our strength reduction pass, compile it into a "dummy pass", and use the same commands to compile the benchmarks. The results of executing this version of the benchmarks is recorded as our baseline in the "Original" column of the table.
Benchmark | Original (s) | Optimized (s) | Speedup |
---|---|---|---|
aha-mont64 | 0.261 | 0.244 | 1.070 |
crc32 | 0.598 | 0.598 | 1 |
cubic | 0.018 | 0.018 | 1 |
edn | 0.439 | 0.389 | 1.129 |
huffbench | 0.564 | 0.027 | 20.889 |
matmult-int | 0.628 | 0.579 | 1.085 |
minver | 0.067 | 0.078 | 0.859 |
nbody | 0.01 | 0.015 | 0.667 |
nettle-aes | 0.316 | wrong | N/A |
nettle-sha256 | 0.348 | 0.364 | 0.956 |
nsichneu | 0.312 | 0.314 | 0.994 |
picojpeg | 0.545 | stuck | N/A |
qrduino | 1.183 | 1.136 | 1.041 |
sglib-combined | 0.612 | wrong | N/A |
slre | 0.729 | wrong | N/A |
st | 0.064 | 0.049 | 1.306 |
statemate | 0.372 | 0.401 | 0.928 |
ud | 0.55 | 0.538 | 1.022 |
wikisort | 0.215 | 0.276 | 0.779 |
Geomean | 1.211 |
Except for a few programs, the average speedup is 1.211x over the original baseline program. Unfortunately, our pass does not optimize all programs correctly: nettle-aes
, sglib-combined
, and slre
do not produce correct results after our optimization, while picojpeg
does not terminate after applying our pass. We have not figured out the reason of these errors yet. Possible reasons are:
- We failed to distinguish all the non-induction variables collected at the beginning of the pass in our analysis and optimization process, causing the program to behave incorrectly;
- When inserting the update instructions, we probably inserted at incorrect positions or inserted multiple times. This may also explain the 20x speedup for huffbench program (i.e., the loop exits before finishing all the work, but the verification function still returns the correct result).