Want good instruction mixes, independent operations
Want vectorizable operations
Communication (including with memory) is slow
Caches provide intermediate cost/capacity points
Designed for spatial and temporal locality
(Trans)portable Performance
Details have orders-of-magnitude impacts
But systems differ in micro-arch, caches, etc
Want transportable performance across HW
Need principles for high-perf code (+ tricks)
Principles
Think before you write
Time before you tune
Stand on shoulders of giants
Help your tools help you
Tune your data structures
Think Before You Write
Premature Optimization
We should forget about small efficiencies, say 97% of the time: premature optimization is the root of all evil.
- Knuth, Structured programming with go to statements, Computing Surveys (4), 1974.
Beware asymptotic complexity analysis for small \(n\)!
Avoid Work
Asymptotic complexity is not everything, but:
Quicksort beats bubble sort for modest \(n\)
Counting sort even faster for modest key space
No time at all if data is already sorted!
Pick algorithmic approaches thoughtfully.
Be Cheap
Our motto: Fast enough, right enough
Want: time saved in compute \(\gg\) time taken in tuning
Your time costs more than compute cycles
No shame in a slow workhorse that gets the job done
Maybe an approximation is good enough?
Depends on application context
Answer usually requires error analysis, too
Do More with Less (Data)
Want lots of work relative to data loads:
Keep data compact to fit in cache
Short data types for better vectorization
But be aware of tradeoffs!
For integers: May want 64-bit ints sometimes!
For floating point: More in other lectures
Remember the I/O
Example: Explicit PDE time stepper on \(256^2\) mesh
0.25 MB per frame (three fit in L3 cache)
Constant work per element (a few flops)
Time to write to disk \(\approx\) 5 ms
If I write once every 100 frames, how much time is I/O?
Time Before You Tune
Back to Knuth
It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
- Knuth, Structured programming with go to statements, Computing Surveys (4), 1974.
Hot Spots and Bottlenecks
Often a little bit of code takes most of the time
Usually called a “hot spot” or bottleneck
Goal: Find and remove (“de-slugging”)
Practical Timing
Things to consider:
Want high-resolution timers
Wall-clock time vs CPU time
Size of data collected vs how informative it is
Cross-interference with other tasks
Cache warm-start on repeated timings
Overlooked issues from too-small timings
Manual Instrumentation
Basic picture:
Identify stretch of code to be timed
Run several times with “characteristic” data
Accumulate time spent
Caveats: Effects from repetition, “characteristic” data
Manual Instrumentation
Was hard to get portable high-resolution wall-clock time!
Level 3 BLAS are key for high-perf transportable LA.
Other Common Kernels
Apply sparse matrix (or sparse matrix powers)
Compute an FFT
Sort an array
Kernel Tradeoffs
Critical to get properly tuned kernels
Interface is consistent across HW types
Implementation varies by archiecture
General kernels may leave performance on table
Ex: General matrix ops for structured matrices
Overheads may be an issue for small \(n\) cases
Kernel Tradeoffs
Building on kernel functionality is not perfect –
But: Ideally, someone else writes the kernel!
(Or it may be automatically tuned)
Help Tools Help You
How can Compiler Help?
In decreasing order of effectiveness:
Local optimization
Espectially restricted to a “basic block”
More generally, in “simple” functions
Loop optimizations
Global (cross-function) optimizations
Local Optimizations
Register allocation: compiler > human
Instruction scheduling: compiler > human
Branch joins and jump elim: compiler > human?
Constant folding and propogation: humans OK
Common subexpression elimination: humans OK
Algebraic reductions: humans definitely help
Loop Optimization
Mostly leave these to modern compilers
Loop invariant code motion
Loop unrolling
Loop fusion
Software pipelining
Vectorization
Induction variable substitution
Obstacles for the Compiler
Long dependency chains
Excessive branching
Pointer aliasing
Complex loop logic
Cross-module optimization
Obstacles for the Compiler
Function pointers and virtual functions
Unexpected FP costs
Missed algebraic reductions
Lack of instruction diversity
Let’s look at a few…
Long Dependency Chains
Sometimes these can be decoupled. Ex:
// Version 0float s =0;for(int i =0; i < n;++i) s += x[i];
Apparently linear dependency chain.
Long Dependency Chains
// Version 1float ss[4]={0,0,0,0};int i;// Sum start of list in four independent sub-sumsfor(i =0; i < n-3; i +=4)for(int j =0; j <4;++j) ss[j]+= x[i+j];// Combine sub-sums, handle trailing elementsfloat s =(ss[0]+ ss[1])+(ss[2]+ ss[3]);for(; i < n;++i) s += x[i];
Pointer Aliasing
Why can this not vectorize easily?
void add_vecs(int n,double* result,double* a,double* b){for(int i =0; i < n;++i) result[i]= a[i]+ b[i];}
Q: What if result overlaps a or b?
Pointer Aliasing
void add_vecs(int n,double*restrict result,double*restrict a,double*restrict b){for(int i =0; i < n;++i) result[i]= a[i]+ b[i];}
C restrict promise: no overlaps in access
Many C++ compilers have __restrict__
Fortran forbids aliasing – part of why naive Fortran speed often beats naive C speed!
“Black Box” Calls
Compiler assumes arbitrary wackiness:
void foo(double*restrict x){double y =*x;// Load x once bar();// Assume bar is a 'black box' fn y +=*x;// Must reload xreturn y;}
Floating Point
Several possible optimizations:
Use different precisions
Use more/less accurate special function routines
Underflow as flush-to-zero vs gradual
But these change semantics! Needs a human.
Optimization Flags
-O0123: no optimization – aggressive optimization
-O2 is usually the default
-O3 is useful, but might break FP codes (for example)