CS 5220

Basic Code Optimization

David Bindel

2024-09-05

Reminder

  • Modern CPUs are wide, pipelined, out-of-order
    • Want good instruction mixes, independent operations
    • Want vectorizable operations
  • Communication (including with memory) is slow
    • Caches provide intermediate cost/capacity points
    • Designed for spatial and temporal locality

(Trans)portable Performance

  • Details have orders-of-magnitude impacts
  • But systems differ in micro-arch, caches, etc
  • Want transportable performance across HW
  • Need principles for high-perf code (+ tricks)

Principles

  • Think before you write
  • Time before you tune
  • Stand on shoulders of giants
  • Help your tools help you
  • Tune your data structures

Think Before You Write

Premature Optimization

We should forget about small efficiencies, say 97% of the time: premature optimization is the root of all evil.
- Knuth, Structured programming with go to statements, Computing Surveys (4), 1974.

Premature Optimization

… Yet we should not pass up our opportunities in that critical 3%.
- Knuth, Structured programming with go to statements, Computing Surveys (4), 1974.

Premature Optimization

  • At design time, think big efficiencies
  • Don’t forget the 3%!
  • And the time is not premature forever!

Functionality First

No prize for speed of wrong answers.

Lay-of-the-Land Thinking

for (int i = 0; i < n; ++i)
    for (int j = 0; j < n; ++j)
        for (int k = 0; k < n; ++k)
            C[i+j*n] += A[i+k*n] * B[k+j*n];
  • What are the “big computations” in my code?
  • What are natural algorithmic variants?
    • Vary loop orders? Different interpretations!
    • Lower complexity algorithm (Strassen?)
  • Should I rule out some options in advance?
  • How can I code so it is easy to experiment?

Don’t Sweat the Small Stuff

  • Fine to have high-level logic in Python and company
  • Probably fine not to tune configuration file readers
  • Maybe OK not to tune \(O(n^2)\) prelude to \(O(n^3)\) algorithm?
    • Depending on \(n\) and on the constants!

How Big?

Typical analysis: time is \(O(f(n))\)

  • Meaning: \(\exists C, N : \forall n \geq N, T_n \leq C f(n)\)
  • Says nothing about constants: \(O(10n) = O(n)\)
  • Ignores lower-order term: \(O(n^3 + 1000n^2) = O(n^3)\)

Beware asymptotic complexity analysis for small \(n\)!

Avoid Work

Asymptotic complexity is not everything, but:

  • Quicksort beats bubble sort for modest \(n\)
  • Counting sort even faster for modest key space
  • No time at all if data is already sorted!

Pick algorithmic approaches thoughtfully.

Be Cheap

Our motto: Fast enough, right enough

  • Want: time saved in compute \(\gg\) time taken in tuning
    • Your time costs more than compute cycles
    • No shame in a slow workhorse that gets the job done
  • Maybe an approximation is good enough?
    • Depends on application context
    • Answer usually requires error analysis, too

Do More with Less (Data)

Want lots of work relative to data loads:

  • Keep data compact to fit in cache
  • Short data types for better vectorization
  • But be aware of tradeoffs!
    • For integers: May want 64-bit ints sometimes!
    • For floating point: More in other lectures

Remember the I/O

Example: Explicit PDE time stepper on \(256^2\) mesh

  • 0.25 MB per frame (three fit in L3 cache)
  • Constant work per element (a few flops)
  • Time to write to disk \(\approx\) 5 ms

If I write once every 100 frames, how much time is I/O?

Time Before You Tune

Back to Knuth

It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
- Knuth, Structured programming with go to statements, Computing Surveys (4), 1974.

Hot Spots and Bottlenecks

  • Often a little bit of code takes most of the time
  • Usually called a “hot spot” or bottleneck
  • Goal: Find and remove (“de-slugging”)

Practical Timing

Things to consider:

  • Want high-resolution timers
  • Wall-clock time vs CPU time
  • Size of data collected vs how informative it is
  • Cross-interference with other tasks
  • Cache warm-start on repeated timings
  • Overlooked issues from too-small timings

Manual Instrumentation

Basic picture:

  • Identify stretch of code to be timed
  • Run several times with “characteristic” data
  • Accumulate time spent

Caveats: Effects from repetition, “characteristic” data

Manual Instrumentation

Profiling Tools

  • Sampling: Interrupt every \(t_{\mathrm{profile}}\) cycles
  • Instrumenting: Rewrite code to insert timers
    • May happen at binary or source level

Time Attribution

May time at function level or line-by-line

  • Function: Can still get mis-attribution from inlinining
  • Line-by-line: Attribution is harder, need debug symbols (-g)

More Profiling Details

  • Distinguish full call stack or not?
  • Time full run, or just part?
  • Just timing, or get other info as well?

Hardware Counters

  • Counters track cache misses, instruction counts, etc
  • Present on most modern chips
  • But may require significant permissions to access

Symbolic Execution

  • Main current example: llvm-mca
  • Symbolically execute assembly on model of core
  • Usually only practical for short segments
  • Can give detailed feedback on (assembly) quality

Shoulders of Giants

What Makes a Good Kernel?

Computational kernels are

  • Small and simple to describe
  • General building blocks (amortize tuning work)
  • Ideally high arithmetic intensity
    • Arithmetic intensity = flops/byte
    • Amortizes memory costs

Case Study: BLAS

Basic Linear Algebra Subroutines

  • Level 1: \(O(n)\) work on \(O(n)\) data
  • Level 2: \(O(n^2)\) work on \(O(n^2)\) data
  • Level 3: \(O(n^3)\) work on \(O(n^2)\) data

Level 3 BLAS are key for high-perf transportable LA.

Other Common Kernels

  • Apply sparse matrix (or sparse matrix powers)
  • Compute an FFT
  • Sort an array

Kernel Tradeoffs

  • Critical to get properly tuned kernels
  • Interface is consistent across HW types
  • Implementation varies by archiecture
  • General kernels may leave performance on table
    • Ex: General matrix ops for structured matrices
  • Overheads may be an issue for small \(n\) cases

Kernel Tradeoffs

Building on kernel functionality is not perfect –
But: Ideally, someone else writes the kernel!

(Or it may be automatically tuned)

Help Tools Help You

How can Compiler Help?

In decreasing order of effectiveness:

  • Local optimization
    • Espectially restricted to a “basic block”
    • More generally, in “simple” functions
  • Loop optimizations
  • Global (cross-function) optimizations

Local Optimizations

  • Register allocation: compiler > human
  • Instruction scheduling: compiler > human
  • Branch joins and jump elim: compiler > human?
  • Constant folding and propogation: humans OK
  • Common subexpression elimination: humans OK
  • Algebraic reductions: humans definitely help

Loop Optimization

Mostly leave these to modern compilers

  • Loop invariant code motion
  • Loop unrolling
  • Loop fusion
  • Software pipelining
  • Vectorization
  • Induction variable substitution

Obstacles for the Compiler

  • Long dependency chains
  • Excessive branching
  • Pointer aliasing
  • Complex loop logic
  • Cross-module optimization

Obstacles for the Compiler

  • Function pointers and virtual functions
  • Unexpected FP costs
  • Missed algebraic reductions
  • Lack of instruction diversity

Let’s look at a few…

Long Dependency Chains

Sometimes these can be decoupled. Ex:

// Version 0
float s = 0;
for (int i = 0; i < n; ++i)
    s += x[i];

Apparently linear dependency chain.

Long Dependency Chains

// Version 1
float ss[4] = {0, 0, 0, 0};
int i;

// Sum start of list in four independent sub-sums
for (i = 0; i < n-3; i += 4)
    for (int j = 0; j < 4; ++j)
        ss[j] += x[i+j];

// Combine sub-sums, handle trailing elements
float s = (ss[0] + ss[1]) + (ss[2] + ss[3]);
for (; i < n; ++i)
    s += x[i];

Pointer Aliasing

Why can this not vectorize easily?

void add_vecs(int n, double* result, double* a, double* b)
{
    for (int i = 0; i < n; ++i)
        result[i] = a[i] + b[i];
}

Q: What if result overlaps a or b?

Pointer Aliasing

void add_vecs(int n, double* restrict result,
    double* restrict a, double* restrict b)
{
    for (int i = 0; i < n; ++i)
        result[i] = a[i] + b[i];
}
  • C restrict promise: no overlaps in access
  • Many C++ compilers have __restrict__
  • Fortran forbids aliasing – part of why naive Fortran speed often beats naive C speed!

“Black Box” Calls

Compiler assumes arbitrary wackiness:

void foo(double* restrict x)
{
    double y = *x;  // Load x once
    bar();    // Assume bar is a 'black box' fn
    y += *x;  // Must reload x
    return y;
}

Floating Point

Several possible optimizations:

  • Use different precisions
  • Use more/less accurate special function routines
  • Underflow as flush-to-zero vs gradual

But these change semantics! Needs a human.

Optimization Flags

-O0123: no optimization – aggressive optimization

  • -O2 is usually the default
  • -O3 is useful, but might break FP codes (for example)

Optimization Flags

Architectural targets

  • “Native” mode targets current architecture
  • Not always the right choice (e.g. head/compute)

Optimization Flags

Specialized flags

  • Turn on/off specific optimization features
  • Often the basic -Ox has reasonable defaults

Auto-Vectorizations Reports

  • Good compilers try to vectorize for you
    • Vendors are pretty good at this
    • GCC / CLang are OK, not as strong
  • Can get reports about what prevents vectorization
    • Not necessarily by default!
    • Helps a lot for tuning

Profile-Guided Optimization

Basic workload

  • Compile code with optimizations
  • Run in a profiler
  • Compile again, provide profiler results

Helps with branch optimization.

Data Layout Matters

“Speed-of-Light”

For compulsory misses:

\[T_{\mathrm{data}} \mbox{ (s)} \geq \frac{\mbox{data required (bytes)}}{\mbox{peak BW (bytes/s)}}\]

Possible optimizations:

  • Shrink working sets to fit in cache (pay this once)
  • Use simple unit-stride access patterns

Reality is more complicated…

When and How to Allocate

Access is not the only cost!

  • Allocation/de-allocation also costs something
  • So does GC (where supported)
  • Beware hidden allocation costs (e.g. on resize)
  • Often bites naive library users

When and How to Allocate

Two thoughts to consider:

  • Preallocation (avoid repeated alloc/free)
  • Lazy allocation (if alloc will often not be needed)

Storage Layout

Desiderata:

  • Compact (fits lots into cache)
  • Traverse with simple access patterns
  • Avoids pointer chasing

Multi-Dimensional Arrays

Two standard formats:

  • Column major (Fortran): Store columns consecutively
  • Row major (C/C++?): Store rows consecutively

Ideally, traverse with unit stride! Layout affects choice.
Can use more sophisticated multi-dim array layouts…

Blocking / Tiling

Classic example: matrix multiply

  • Load \(b \times b\) block of \(A\)
  • Load \(b \times b\) block of \(B\)
  • Compute product of blocks
  • Accumulate into \(b \times b\) block of \(C\)

Have \(O(b^3)\) work for \(O(b^2)\) memory references!

Alignment and Vectorization

  • Vector load/stores faster if aligned (e.g. start at memory addresses that are multiples of 64 or 256)
  • Can ask for aligned blocks of memory from allocator
  • Then want aligned offsets into aligned blocks
  • Have to help compiler recognize aligned pointers!

Cache Conflicts

Issue: What if strided access causes conflict misses?

  • Example: Walk across row of col-major matrix
  • Example: Parallel arrays of large-power-of-2 size

Not the most common problem, but one to watch for

Structure Layouts

  • Want \(b\)-byte types on \(b\)-byte memory boundaries
  • Compiler may pad structures to enforce this
  • Arrange structure fields in decreasing size order

SOA vs AOS

// Structure of arrays (parallel arrays)
typedef struct {
    double* x;
    double* y;
} soa_points_t;

// Array of structs
typedef struct {
    double x;
    double y;
} point_t;
typedef point_t* soa_points_t;

SOA vs AOS

SoA: Structure of Arrays

  • Friendly to vectorization
  • Poor locality to access all of one item
  • Awkward for lots of libraries and programs

SOA vs AOS

AoS: Array of Structs

  • Naturally supported default
  • Not very SIMD-friendly

Can use C++ zip_view to iterate over SOA like AOS.

Copy Optimizations

Can copy between formats to accelerate, e.g.

  • Copy piece of AoS to SoA format
  • Perform vector operations on SoA data
  • Copy back out

Performance gains > copy costs?
Plays great with tiling!

For the Control Freak

Can get (some) programmer control over

  • Pre-fetching
  • Uncached memory stores

But usually best left to compiler / HW.

Summary

Strategy

  • Think some about performance before writing
  • After coding, time to identify what needs tuning
  • Tune data layouts and access patterns together
  • Work with compiler on low-level optimizations