CS 5220

Applications of Parallel Computers

Code Optimization

Prof David Bindel

Please click the play button below.

Reminder: Modern processors

Modern CPUs are

Wide: start / retire multiple instructions per cycle
Pipelined: overlap instruction executions
Out-of-order: dynamically schedule instructions

Reminder: Modern processors

Want lots of instruction-level parallelism (ILP)
Complicated! Compiler should handle details
Implication: we should give the compiler
- Good instruction mixes
- Independent operations
- Vectorizable operations

Reminder: Memory systems

Memory access are expensive!
Flop time \(\ll\) bandwidth\(^{-1}\) \(\ll\) latency
Caches provide intermediate cost/capacity points
Cache benefits from
- Spatial locality (regular local access)
- Temporal locality (small working sets)

Goal: (Trans)portable performance

Attention to detail has orders-of-magnitude impact
Systems differ in micro-architectures, caches
Want (trans)portable performance across HW
Need principles for high-perf code along with tricks

Basic principles

Think before you write
Time before you tune
Stand on the shoulders of giants
Help your tools help you
Tune your data structures

Think before you write

Premature optimization

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
– Don Knuth

Premature optimization

Wrong reading: “Performance doesn’t matter”

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
– Don Knuth

Premature optimization

What he actually said (with my emphasis)

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
– Don Knuth

Premature optimization

Don’t forget the big efficiencies!
Don’t forget the 3%!
Your code is not premature forever!

Don’t sweat the small stuff

Speed-up from tuning \(\epsilon\) of code \(< (1-\epsilon)^{-1} \approx 1 + \epsilon\);
OK if

High-level stuff in Matlab or Python
Configuration file reader is un-tuned
\(O(n^2)\) prelude to \(O(n^3)\) algorithm is untuned?

Lay-of-the-land thinking

for (i = 0; i < n; ++i)
    for (j = 0; j < n; ++j)
        for (k = 0; k < n; ++k)
            C[i+j*n] += A[i+k*n] * B[k+j*n];

What are the “big computations” in my code?
What are the natural algorithmic variants?
- Vary loop orders? Different interpretations!
- Lower complexity algorithm (Strassen?)
Should I rule out some options in advance?
How can I code so it is easy to experiment?

How big is \(n\)?

Typical analysis: time is \(O(f(n))\)

Meaning: \(\exists C, N : \forall n \geq N, T_n \leq C f(n)\).
Says nothing about constant factors: \(O(10 n) = O(n)\)
Ignores lower order term: \(O(n^3 + 1000 n^2) = O(n^3)\)

Beware asymptotic complexity arguments about small-\(n\) codes!

Avoid work

bool any_negative1(int* x, int n)
{
    bool result = false;
    for (int i = 0; i < n; ++i)
        result = (result || x[i] < 0);
    return result;
}

bool any_negative2(int* x, int n)
{
    for (int i = 0; i < n; ++i)
        if (x[i] < 0)
            return true;
    return false;
}

Be cheap

Fast enough, right enough \(\implies\)
Approximate when you can get away with it.

Do more with less (data)

Want lots of work relative to data loads:

Keep data compact to fit in cache
Use short data types for better vectorization
But be aware of tradeoffs!
- For integers: may want 64-bit ints sometimes!
- For floating-point: more in other lectures

Remember the I/O!

Example: Explicit PDE time stepper on \(256^2\) mesh

0.25 MB per frame (three fit in L3 cache)
Constant work per element (a few flops)
Time to write to disk \(\approx\) 5 ms

If I write once every 100 frames, how much time is I/O?

Time before you tune

Hot spots and bottlenecks

Often a little bit of code takes most of the time
Usually called a “hot spot” or bottleneck
Goal: Find and eliminate
- Cute coinage: “de-slugging”

Practical timing

Need to worry about:

System timer resolutions
Wall-clock time vs CPU time
Size of data collected vs how informative it is
Cross-interference with other tasks
Cache warm-start on repeated timings
Overlooked issues from too-small timings

Finding bottlenecks requires timing, and timing code is surprisingly subtle, for several reasons. First, your system has many timers and not all have the resolution to see an event of a few nanoseconds. Second, there is a difference between CPU time and wall clock time. CPU time is the time your program spends running; but if your code has to wait (e.g. for I/O), it may yield the CPU to let other programs run instead. That waiting time counts toward the wall clock time, but not the CPU time. Third, tracing your program in detail generates an enormous amount of data, enough to cause slow-downs trying to store it all. There is a balance in timing the right details and letting the rest go. Fourth, you may find that timings vary depending what other processes are running on our machine (and there are always some). Fifth, you may find that running the same test repeatedly yields different times, depending on the state of the cache. Usually, the cache is "cold" the first time, so there are more misses. Finally, cache effects mean that you might see very different behavior for small problems than you see for big problems, so make sure you try some timing with a realistic test problem.

Manual instrumentation

Basic picture:

Identify stretch of code to be timed
Run it several times with “characteristic” data
Accumulate the total time spent

Caveats: Effects from repetition, “characteristic” data

Manual instrumentation

Hard to get portable high-resolution wall-clock time!
Solution: omp_get_wtime()
Requires OpenMP support (still not CLang)

Types of profiling tools

Sampling vs instrumenting

Sampling: Interrupt every \(t_{\mathrm{profile}}\) cycles
Instrumenting: Rewrite code to insert timers
Instrument at binary or source level

Instead of manually instrumenting our code, we can also try using profiling tools that automatically collect timing statistics. Broadly speaking, there are two standard types of profilers. Instrumenting profilers automate the type of workflow that we use with manual instrumentation, adding code to record timing information to either a compiled binary or to source code. Sampling profilers, on the other hand, interrupt the code at regular (but not too frequent!) intervals and record where the program counter was at the time of the interrupt. Sampling profilers are "lighter weight" in the sense that they do not require doing anything to the code, and they often take fewer CPU resources than a fine-grain instrumented profile would.

Types of profiling tools

Function level or line-by-line

Function: Inlining can cause mis-attribution
Line-by-line: Requires debugging symbols (-g)

There are different ways that we might report profiling results. We might just report what function we are in, or we might try to report what line of code we are executing. Neither of these is always straightforward after compiler optimizations are taken into account. When we inline a function - that is, copy the code into another function instead of actually doing a function call (with a new stack frame, jump, and return) - it is easy for a profiler to mis-attribute work done in the callee to the caller. And when we attempt line-by-line profiling, we always have to remember that both the compiler and the hardware can re-order instructions on us, so long as that reordering doesn't change the program semantics. So it is not always so easy to attribute a given instruction to a particular line of code. And usually we have to ask the compiler nicely to even attempt to label the code at this level. Since this type of information is generally used for debugging as well as for profiling, it's often called debugging symbol support; the flag used to enable such symbols on many compilers is -g.

Types of profiling tools

Context information?

Distinguish full call stack or not?

Types of profiling tools

Time full run, or just part?

Hardware counters

Counters track cache misses, instruction counts, etc
Present on most modern chips
May require significant permissions to access...

So far, we have only been talking about profiling time (though the same techniques can be used to profile the use of resources like memory). But in the last lecture, we pointed out that a lot of performance has to do with things like cache misses. How can we track that information? It's not as easy as tracking time or heap memory use. We typically either have to simulate the cache (which is expensive), or we have to rely on performance counter hardware that is on the chip. Unfortunately, getting the information out of these counters is not necessarily easy. One has to have root-level privileges to read these counters, and that means that one needs either root-level privileges for the profiler or dedicated support in the operating system kernel. There is such support in Linux (the perf subsystem), and in MacOS with the Instruments application. On Windows, the Visual Studio profiler is supposed to have access to the hardware counters, though I have less personal experience with that one.

Automated analysis tools

Examples: MAQAO, IACA, LLVM-MCA
Symbolic execution of model of a code segment
Usually only practical for short segments
Can give detailed feedback on (assembly) quality

Shoulders of giants

What makes a good kernel?

Computational kernels are

Small and simple to describe
General building blocks (amortize tuning work)
Ideally high arithmetic intensity
- Arithmetic intensity = flops/byte
- Amortizes memory costs

Case study: BLAS

Basic Linear Algebra Subroutines

Level 1: \(O(n)\) work on \(O(n)\) data
Level 2: \(O(n^2)\) work on \(O(n^2)\) data
Level 3: \(O(n^3)\) work on \(O(n^2)\) data

Level 3 BLAS are key for high-perf transportable LA.

A prototypical example of a kernel interfaces is the BLAS, or Basic Linear Algebra Subroutines. This is a collection of linear algebra primitives that people care about tuning for a variety of different architectures. There are three so-called "levels" in the BLAS interfaces, depending on the complexity of the operation relative to the dimensions of the vector spaces involved. Level 1 BLAS involves order n work on order n data; things like dot products and vector sums and scalings are good examples. Level 2 BLAS involves order n^2 work on order n^2 data; the standard example here is matrix-vector products. Level 3 BLAS involves order n^3 work on order n^2 data; the standard example here is matrix-matrix products. Level 3 BLAS makes a good substrate for high-performance linear algebra, because it does a lot of work per data item. So if we arrange things right, we can get a lot of cache re-use, and high flop rates.

Other common kernels

Apply sparse matrix (or sparse matrix powers)
Compute an FFT
Sort a list

Kernel trade-offs

Critical to get properly tuned kernels
- Interface is consistent across HW types
- Implementation varies according to arch
General kernels may leave performance on the table
- Ex: General matrix ops for structured matrices
Overheads may be an issue for small \(n\) cases

When we rely on kernel operations, we are often making a trade-off. Really fast kernels often involve a common interface, consistent across different platforms, but different implementations under the hood. That means lot of work, and it often doesn't make sense to pay for that work unless it is going to be amortized across many different uses. So we want kernels that are quite general in their use. At the same time, most scientific problems of interest are not completely general. They have some type of structure, and clever algorithms can often take advantage of that structure. We might also find that a kernel that is well-designed for a single big computation has too much setup overhead when we are trying to do many tiny computations of the same type.

Kernel trade-offs

Building on kernel functionality is not perfect --
But: Ideally, someone else writes the kernel!

(Or it may be automatically tuned)

So kernels may not be a perfect match for what we want to do, and we might leave some performance behind. At the same time, if someone else writes the kernels for us, that saves us a lot of time and effort! Alternately, someone may not write the kernel for us, but instead they might write something that automates the process of code tuning, exploring over a space of possible implementations in order to find the fastest one. This idea of auto-tuning is good to know about even if we don't plan to spend our career writing BLAS libraries, as it can be a good way of simultaneously getting the advantages of hardware-specific kernels and problem-specific algorithms that might not merit hand-tuning for each architecture.

Help your tools help you

How can compiler help?

In decreasing order of effectiveness:

Local optimization
- Especially restricted to a “basic block”
- More generally, in “simple” functions
Loop optimizations
Global (cross-function) optimizations

Local optimizations

Register allocation: compiler > human
Instruction scheduling: compiler > human
Branch joins and jump elim: compiler > human?
Constant folding and propogation: humans OK
Common subexpression elimination: humans OK
Algebraic reductions: humans definitely help

There are some things that the compiler is usually much better at than a human. These include very low-level tasks like register allocation and instruction scheduling, which we mostly don't even end up thinking about explicitly as programmers. For the most part, this is also true of tasks involving the details of how conditional statements work. For computation of constants, identifying common subexpressions, and certain types of algebraic reductions, it's useful for a human to be involved. This is particularly true when we think about floating point arithmetic; lack of associativity means that there are lots of rearrangements that would make sense in real arithmetic, but aren't equivalent in floating point. The compiler will probably figure out for you that x+1+2 and x+3 are the same thing; it's much less likely to play with trig identities on your behalf.

Loop optimizations

Mostly leave these to modern compilers

Loop invariant code motion
Loop unrolling
Loop fusion
Software pipelining
Vectorization
Induction variable substitution

Obstacles for the compiler

Long dependency chains
Excessive branching
Pointer aliasing
Complex loop logic
Cross-module optimization

Let's be a little more concrete about this. What are the things that you might do by accident that make it very hard for the compiler to help you? One of the obvious ones is setting up long dependency chains. Part of what the code optimizer does for you is to try to rearrange code within a "peephole" as best it can. If you look like you are writing a lot of instructions that take as input the output of a previous instruction, the compiler might not figure out that it should try to interleave those instructions with a separate stream of instructions doing independent work. You might have to help with that. Code that has lots of branches is hard for the compiler (and hardware) to deal with, particularly when those branches aren't easy to predict. Complicated loop logic can be hard to deal with, too. Potential aliasing is another big issue. In C and C++, two pointers to the same type of data are assumed to be able to refer to the same item in memory, unless the compiler can prove otherwise. This is a major issue when a function takes in array arguments (as pointers) and writes to some of the array locations. If we write to one array and read from another, the compiler is not allowed to swap the order of those two operations unless it can prove there is no aliasing. This type of aliasing effect also interferes with our ability to do vectorization. In C, the restrict keyword was added to the language to give a way for the programmer to promise that no aliasing is going to happen. The semantics of restrict get to be a little subtle, but it is very useful. In C++, there is no standard analog of the restrict keyword, though similar things are implemented in language extensions associated with particular compilers. In contrast to C and C++, Fortran assumes that there is no aliasing between arguments. This is one of the things that makes it often easier to write naive Fortran code that is still fast than is the case for C/C++. Finally, if a compiler has to invoke code that it can't immediately see, it has to assume that code could do all sorts of crazy things. So function calls, particularly function calls to modules that are not compiled simultaneously, can severely limit the types of rearrangements the compiler can do on your behalf. That doesn't mean you should stop using functions! It does mean that moving around where function calls happen might make a difference to performance for some types of code.

Obstacles for the compiler

Function pointers and virtual functions
Unexpected FP costs
Missed algebraic reductions
Lack of instruction diversity

Let’s look at a few...

In languages like C++, one sometimes calls functions through a level of indirection. These virtual function calls - or calls to functions through pointers in C - are a great mechanism for abstraction. But they kill all sorts of inter-procedural optimizations that the compiler might do on our behalf. It's not something to worry about most of the time, but it might matter if it happens in the middle of a performance-critical loop. There are also sometimes things that puzzle the compiler because floating point is weird. We will talk about this in detail later, but examples include: rearrangements that the compiler really shouldn't make on your behalf because of lack of associativity; problems that come about because of the behavior of things like NaN (not-a-number); and so forth. We already briefly mentioned that you're probably better at algebra than the compiler is, so you can't always assume it will figure out that it could rewrite an expression on your behalf. And, along with long dependency chains, the peephole optimizer may stumble over long chunks of code that are only using a couple types of instructions. If you have a mix of different types of operations, you can keep the hardware more fully occupied.

Ex: Long dependency chains

Sometimes these can be decoupled (e.g. reduction loops)

// Version 0
float s = 0;
for (int i = 0; i < n; ++i)
    s += x[i];

Apparent linear dependency chain. Compilers might handle this, but let’s try ourselves...

Ex: Long dependency chains

Key: Break up chains to expose parallel opportunities

// Version 1
float s[4] = {0, 0, 0, 0};
int i;

// Sum start of list in four independent sub-sums
for (i = 0; i < n-3; i += 4)
    for (int j = 0; j < 4; ++j)
        s[j] += x[i+j];

// Combine sub-sums and handle trailing elements
float s = (s[0]+s[1]) + (s[2]+s[3]);
for (; i < n; ++i)
    s += x[i];

Ex: Pointer aliasing

Why can this not vectorize easily?

void add_vecs(int n, double* result, double* a, double* b)
{
    for (int i = 0; i < n; ++i)
        result[i] = a[i] + b[i];
}

Q: What if result overlaps a or b?

Ex: Pointer aliasing

C99: Use restrict keyword

void add_vecs(int n, double* restrict result,
        double* restrict a, double* restrict b);

Implicit promise: these point to different things in memory.

Fortran forbids aliasing — part of why naive Fortran speed beats naive C speed!

Ex: “Black box” function calls

Compiler must assume arbitrary wackiness from “black box” function calls

double foo(double* restrict x)
{
    double y = *x;  // Load x once
    bar();    // Assume bar is a 'black box' fn
    y += *x;  // Must reload x
    return y;
}

Ex: Floating point issues

Several possible optimizations available:

Use different precisions
Use more/less accurate special function routines
Underflow is flush-to-zero or gradual

Ex: Floating point issues

Problem: This changes semantics!

Compiler pretends floats are reals and hopes?
This will break some of my codes!
Human intervention is indicated

The problem with "fast math" is that it changes the results. If you believe that floating point is just fuzzy, inexact real arithmetic, maybe what you get from this is OK. But there are clever tricks that experts play with floating point, and "fast math" breaks these tricks. For that matter, there are some things that you might assume to be true that manage to be true in ordinary floating point, but not in "fast math" world. For example, for finite numbers, is the statement that x = y the same as the statement that x - y = 0? In ordinary floating point, the answer is yes; but not for "fast math" variants that underflow to zero rather than underflowing gradually. Unfortunately, we mostly get to turn on and off "fast math" optimizations at a module level, rather than doing something at finer granularity. I don't necessarily recommend against using it, but I do recommend thinking carefully about it first.

Optimization flags

-O[0123] (no optimization – aggressive optimization)

-O2 is usually the default
-O3 is useful, but might break FP codes (for example)

Optimization flags

Architecture targets

“Native” mode targets current architecture
Not always the right choice (e.g. head/compute)

Optimization flags

Specialized optimization flags

Turn on/off specific optimization features
Often the basic -Ox has reasonable defaults

Auto-vectorization and compiler reports

Good compilers try to vectorize for you
- Intel is pretty good at this
- GCC / CLang are OK, not as strong
Can get reports about what prevents vectorization
- Not necessarily by default!
- Helps a lot for tuning

Profile-guided optimization

Basic workflow:

Compile code with optimizations
Run in a profiler
Compile again, provide profiler results

Helps compiler optimize branches based on observations.

Data layout matters

“Speed-of-light” analysis

For compulsory misses to load cache: \[T_{\mbox{data}} \mbox{ (s)} \quad \geq \quad \frac{\mbox{data required (bytes)}} {\mbox{peak BW (bytes/s)}}\] Possible optimizations:

Shrink working sets to fit in cache (pay this once)
Use simple unit-stride access patterns

Reality is generally more complicated...

When and how to allocate

Why is this an \(O(n^2)\) loop?

x = [];
for i = 1:n
  x(i) = i;
end

There are two reasons that we use C for this course. First, it is a small language (much smaller than C++!). And, second, it is pretty low-level. In particular, you have to manage allocating and deallocating data structures for yourself. There is no automatic allocation or garbage collection. But one hopes that most of the code that you write in life will not be in raw C code, but using a language that's a little higher level. For example, here's a snippet of MATLAB code that builds up an array of length n. On the surface, it looks like it ought to take O(n) time. In reality, it takes O(n^2) time. Why? Because each time we add a new element to the end of the array, MATLAB has to reallocate a new chunk of space and copy the old data over to the new location! Unfortunately, this class of problems is not restricted to MATLAB. There are many situations where the cost of allocating, deallocating, and copying data is substantial - sometimes, more than the cost of computing with the data!

When and how to allocate

Access is not the only cost!

Allocation / de-allocation also costs something
So does garbage collection (where supported)
Beware hidden allocation costs (e.g. on resize)
Often bites naive library users

When and how to allocate

Two thoughts to consider

Pre-allocation (avoid repeated alloc/free)
Lazy allocation (if alloc will often not be needed)

Storage layout

Desiderata:

Compact (fit lots into cache)
Traverse with simple access patterns
Avoids pointer chasing

Multi-dimensional arrays

Two standard formats:

Col-major (Fortran): Store columns consecutively
Row-major (C/C++): Store rows consecutively

Ideally, traverse with unit stride! Layout affects choice.

Can use more sophisticated multi-dim array layouts...

Blocking / tiling

Classic example: Matrix multiply

Load \(b \times b\) block of \(A\)
Load \(b \times b\) block of \(B\)
Compute product of blocks
Accumulate into \(b \times b\) block of \(C\)

Have \(O(b^3)\) work for \(O(b^2)\) memory references!

As we discussed earlier, naive matrix-matrix products for big matrices are horribly inefficient, because they access memory in an inefficient way. The problem is not inherent to the operation, though. There's a simple idea that lets us go much faster, called blocking or tiling. The idea is this: instead of thinking about the matrix as an array of scalars, think about it as an array of smaller arrays, each of size b-by-b. Then we can organize the big matrix-matrix product in terms of multiplication of pairs of smaller b-by-b submatrices. This involves the same arithmetic as the three nested loop algorithm; we are just arranging that arithmetic differently using associativity. If those smaller submatrices fit into cache, then we can do a lot of floating point operations per slow memory access. Our effective working set is now the small matrices, rather than the big matrices. We'll see a lot more of this approach soon when we start our first project (matrix-matrix multiplication).

Data alignment and vectorization

Vector load/stores faster if aligned (start at memory addresses that are multiples of 64 or 256)
Can ask for aligned blocks of memory from allocator
Then want aligned offsets into aligned blocks
Have to help compiler recognize aligned pointers!

Data alignment and cache contention

Issue: What if strided access causes conflict misses?

Example: Walk across row of col-major matrix
Example: Parallel arrays of large-power-of-2 size

Not the most common problem, but one to watch for.

Structure layouts

Want \(b\)-byte types on \(b\)-byte memory boundaries
Compiler may pad structures to enforce this
Arrange structure fields in decreasing size order

SoA vs AoS

// Struct of Arrays (parallel arrays)
typedef struct {
    double* x;
    double* y;
} aos_points_t;

// Array of Structs
typedef struct {
    double x;
    double y;
} point_t;
typedef point_t* soa_points_t;

SoA vs AoS

SoA: Structure of Arrays

Friendly to vectorization
Poor locality to access all of one item
Awkward for lots of libraries and programs

SoA vs AoS

AoS: Array of Structs

Naturally supported default
Not very SIMD-friendly

SoA vs AoS

Possible to combine the two...

Copy optimizations

Copy between formats to accelerate, e.g.

Copy piece of AoS to SoA format
Perform vector operations on SoA data
Copy back out

Performance gains > copy costs?
Plays great with tiling!

For the control freak

Can get (some) programmer control over

Pre-fetching
Uncached memory stores

But usually best left to compiler / HW.

References

Hager and Wellein, Intro to HPC for Scientists and Engineers
Goedecker and Hoisie, Performance Optimization of Numerically Intensive Codes
Agner Fog’s Software Optimization Manuals

CS 5220

Applications of Parallel Computers

Code Optimization

Reminder: Modern processors

Reminder: Modern processors

Reminder: Memory systems

Goal: (Trans)portable performance

Basic principles

Think before you write

Premature optimization

Premature optimization

Premature optimization

Premature optimization

Don’t sweat the small stuff

Lay-of-the-land thinking

How big is \(n\)?

Avoid work

Be cheap

Do more with less (data)

Remember the I/O!

Time before you tune

Hot spots and bottlenecks

Practical timing

Manual instrumentation

Manual instrumentation

Types of profiling tools

Types of profiling tools

Types of profiling tools

Types of profiling tools

Hardware counters

Automated analysis tools

Shoulders of giants

What makes a good kernel?

Case study: BLAS

Other common kernels

Kernel trade-offs

Kernel trade-offs

Help your tools help you

How can compiler help?

Local optimizations

Loop optimizations

Obstacles for the compiler

Obstacles for the compiler

Ex: Long dependency chains

Ex: Long dependency chains

Ex: Pointer aliasing

Ex: Pointer aliasing

Ex: “Black box” function calls

Ex: Floating point issues

Ex: Floating point issues

Optimization flags

Optimization flags

Optimization flags

Auto-vectorization and compiler reports

Profile-guided optimization

Data layout matters

“Speed-of-light” analysis

When and how to allocate

When and how to allocate

When and how to allocate

Storage layout

Multi-dimensional arrays

Blocking / tiling

Data alignment and vectorization

Data alignment and cache contention

Structure layouts

SoA vs AoS

SoA vs AoS

SoA vs AoS

SoA vs AoS

Copy optimizations

For the control freak

References

References

Onward!