CS 5220

Applications of Parallel Computers

Single-Core Architecture

Prof David Bindel

Please click the play button below.

Just for fun

Is this fair?

See: “Should I port my code to a GPU?”

The real world

The idealized machine

Address space of named words
Basic ops: register read/write, logic, arithmetic
Everything runs in the program order
High-level language \(\rightarrow\) “obvious” machine code
All operations take about the same amount of time

The real world

Memory operations are not all the same!

Speeds vary (registers and caches)
Memory layout dramatically affects performance

The real world

Instructions are non-obvious!

Pipelining allows instructions to overlap
Functional units run in parallel (and out of order)
Instructions take different amounts of time
Cost depends on order, instruction mix

The real world

Our goal:
enough understanding to help the compiler out.

Prelude

We hold these truths to be self-evident:

One should not sacrifice correctness for speed
One should not re-invent (or re-tune) the wheel
Your time matters more than computer time

Prelude

Less obvious, but still true:

Most of the time goes to a few bottlenecks
The bottlenecks are hard to find without measuring
Communication is expensive (often a bottleneck)

Prelude

A little good hygiene will save your sanity

Automate testing
Time carefully
Use version control

A sketch of reality

Today, a play in two acts:

One core is not so serial
Memory matters

Act 1

One core is not so serial.

Parallel processing at the laundromat

Three stages to laundry: wash, dry, fold.
Three loads: darks, lights, underwear
How long will this take?

Parallel processing at the laundromat

Serial version:

1	2	3	4	5	6	7	8	9
wash	dry	fold
			wash	dry	fold
						wash	dry	fold

Parallel processing at the laundromat

Pipeline version:

1	2	3	4	5
wash	dry	fold			Dinner?
	wash	dry	fold		Cat videos?
		wash	dry	fold	Gym and tanning?

Pipelining

Pipelining improves bandwidth, but not latency
Potential speedup = number of stages
- But what if there’s a branch?

Pipelining works for instructions as well as for laundry. Each instruction, or load of laundry, takes the same time to complete with or without the pipeline; the latency does not change. But by overlapping the instructions or the laundry, we can improve the throughput, or completion rate per unit time. With enough instructions, the potential speedup is equal to the number of overlapping stages, though with fewer instructions we care about the time to start and drain the pipeline. Unfortunately, pipelining requires a systematic pattern. If we are not sure what stage comes next, perhaps because of a branch in our code, we cannot take advantage of the pipeline. We have to introduce a so-called bubble, and that reduces our effective throughput.

Pipelining

Different pipelines for different functional units

Front-end has a pipeline
Functional units have own pipelines
- Example: FP adder, FP multiplier
- Divider often not pipelined

SIMD

Single Instruction Multiple Data
Cray-1 (1976): 8 registers \(\times\) 64 words of 64 bits each
Resurgence in mid-late 90s (for graphics)
Now short vectors (256-512 bit) are ubiquitous

Wide front-end

Fetch/decode or retire multiple ops at once

Limited by instruction mix
(Different ops use different ports)
NB: May dynamically translate to micro-ops

Hyperthreading

Support multiple HW threads / core

Independent registers, program counter
Shared functional units
Helps feed core independent work

Out-of-order execution

Internally reorder operations
Have to commit instructions in order
May throw away uncommited results
(speculative execution)
Limited by data dependencies

All together, now...

Front-end reads several ops at once
Ops may act on vectors (SIMD)
Break into mystery micro-ops (and cache)
Out-of-order scheduling to functional units
Pipelining within functional units
In-order commit of finished ops
Can discard before commit
(speculative execution)

Putting everything together: modern processor can read several operations at once. Each operation might act on a full vector of data. We break those operations into mystery micro operations and cache them. Within the chip, an out-of-order scheduling unit dispatches the micro operations to functional units. The functional units are pipelined and can execute multiple instructions simultaneously. At the end, we commit results in order, to retain the illusion of sequential execution. To keep the functional units from being left idle, we might try work that we don't know we will need, deciding whether to complete the instruction or not at commit time. This is called speculative execution. Does this sound complicated? It should!

Punchline

Compiler understands CPU in principle

Rearranges instructions to get a good mix
Tries to make use of FMAs, SIMD instructions, etc

Punchline

Needs help in practice:

Set optimization flags, pragmas, etc
Make code obvious and predictable
Expose local independent work
Use special intrinsics or library routines
Data layouts, algorithms to suit machine

Punchline

The goal:

You handle high-level optimization
Compiler handles low-level stuff

Act 2

Memory matters.

Basic problem

Memory latency = how long to get a requested item
Memory bandwidth = steady-state rate
Bandwidth improving faster than latency
Inverse bandwidth remains worse than flop rate

My machine

Theoretical peak flop rate: 51.2 GFlop/s (w/o turbo)
Peak memory bandwidth: 31.79 GB/s (2 banks)
Arithmetic intensity = flops / memory accesses
Example: Sum several million doubles (AI = 1)?
- About 4 GFlop/s peak if BW limited
- Much worse if latency limited
So what can we do?

Locality

Programs usually have locality

Spatial locality: things close to each other tend to be accessed consecutively
Temporal locality: use a “working set” of data repeatedly

Cache hierarchy built to use locality.

Fortunately, we can get around the long latency and low bandwidth of main memory by taking advantage of locality in our programs. Two types of locality of access matter to us: spatial locality, or the tendency to access things close to each other in memory at around the same time; and temporal locality, or the tendency to re-use the same piece of data repeatedly in a short period. Modern machines introduce a set of small, fast memories called caches in order to speed up average memory access times. Caches are designed to take advantage of spatial and temporal locality in our codes. One implication of this is that if our codes don't exhibit temporal or spatial locality, maybe they should; it will let them run faster!

How caches help

Hide memory costs by reusing data
- Exploit temporal locality
Use bandwidth to
- Fetch by cache line (spatial locality)
- Support multiple reads
- Prefetch data

This is mostly automatic and implicit.

Cache basics

Store cache lines of several bytes
Cache hit when copy of needed data in cache
Cache miss otherwise. Three basic types:
- Compulsory miss: never used this data before
- Capacity miss: filled the cache with other things since this was last used – working set too big
- Conflict miss: insufficient associativity for access pattern

Cache associativity

Where can data go in cache?

Direct-mapped: each address can only go in one cache location (e.g. store address xxxx1101 only at cache location 1101)
\(n\)-way: each address can go into one of \(n\) possible cache locations (store up to 16 words with addresses xxxx1101 at cache location 1101).

Higher associativity is more expensive.

Teaser

We have \(N = 10^6\) two-dimensional coordinates, and want their centroid. Which of these is faster and why?

Store an array of \((x_i, y_i)\) coordinates. Loop \(i\) and simultaneously sum the \(x_i\) and the \(y_i\).
Store an array of \((x_i, y_i)\) coordinates. Loop \(i\) and sum the \(x_i\), then sum the \(y_i\) in a separate loop.
Store the \(x_i\) in one array, the \(y_i\) in a second array. Sum the \(x_i\), then sum the \(y_i\).

Caches on my laptop

32 KB L1 data and memory caches (per core),
8-way associative
256 KB L2 cache (per core),
4-way associative
2 MB L3 cache (per core),
12-way associative

A memory benchmark (membench)


for array A of length L from 4 KB to 8MB by 2x
  for stride s from 4 bytes to L/2 by 2x
    time the following loop
    for i = 0 to L by s
      load A[i] from memory

A useful way to see the effects of the memory system is with a simple benchmark that repeatedly accesses different numbers of array entries at different strides (distances apart in memory). The stride is relevant to the spatial locality, as the the last few bits of the address are often used to determine the set of cache lines where data can be stored. So accessing with a stride that is a large multiple of two causes us to use the same set of cache lines over and over. The number of locations we access is important to temporal locality: more locations means more pressure on the cache system. I've linked to a repository for the membench code from the slide. I suggest trying to build it on whatever machine you plan to use for development. I used the OpenMP timing routines, so it is not necessarily trivial to build everywhere. On a Mac, for example, you need to be sure to use GCC rather than CLang, since the CLang compiler in XCode doesn't support OpenMP. Anyhow, give it a try, and post on Piazza if you run into trouble.

membench on my laptop

Line graph of membench timing data — Raw timings (CSV)

membench on my laptop

Heat map of membench timing data — Raw timings (CSV)

This is a heatmap picture of the same membench results. The vertical axis represents array size, and the horizontal represents the stride. The color represents the observed latency. Maybe I could use better colors, but let me tell you what I see here. Latencies remain lower for strides less than 2^5. That is partly because a cache line on this machine is 64 bits, so short strides result in multiple hits per line loaded. There are also three diagonals near the edge that remain low latency. This is because near the diagonal, we have a small working set. If we only ever read eight or fewer elements, then it is fine if they all land on the same eight-line set within the cache. Each core has a 2 MB L3 cache, and we see nothing go too bad when all data fits in at least one of the caches. We can also faintly see a change in color at the vertical line corresponding to 2^18 (256K), which is the L2 cache size. We can see another diagonal-ish pattern about ten diagonals in. It turns out that there is another part of the memory system, a cache called the translation lookaside buffer (TLB). It has 512 entries, each corresponding to a 4K page. Missing in the TLB is rather expensive, too. Note that the ballpark estimate of 100 ns to go to main memory seems like it is probably pessimistic. The worst memory times I see on this machine, at least in this plot, are more like 30 ns. That is still an overhead I'd rather not pay too often.

membench on my laptop

Vertical: 64B line size (\(2^5\)), 4K page size (\(2^{12}\))
Horizontal: 32K L1 (\(2^{15}\)), 256K L2 (\(2^{18}\)), 2 MB L3 ((\(2^{21}\))
Diagonal: 8-way cache associativity, 512 entry L2 TLB

The moral

Even for simple programs, performance is a complicated function of architecture!

Need to know a little to write fast programs
Want simple models to understand efficiency
Want tricks to help design fast codes
- Example: blocking (also called tiling)