Parallel HW and Models
2024-09-05
c4-standard-2
(vs e2
)
Log-log plot showing memory/compute bottlenecks.
Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 2009, 52(4).
Basic components: processors, memory, interconnect.
Programming model through languages, libraries.
For performance, need cost models (involves HW)!
How can we parallelize dot product?
Program consists of threads of control.
Consider pdot
on \(p \ll n\) processors:
Of course, it can’t be that simple…
A race condition is when:
Consider s += partial
on two CPUs (s
shared).
Processor 1
load S
add partial
…
store S
…
…
Processor 2
…
…
load S
…
add partial
store S
Implicitly assumed sequential consistency:
Can consider s += partial
a critical section
Dot product with mutex:
l
partial
l
s += partial
l
Still need to synchronize on return…
Processor 1
Processor 2
What if both processors execute step 1 simultaneously?
Shared memory correctness is hard
And this is before we talk performance!
Shared memory is expensive!
Processor 1
Processor 2
What could go wrong?
Processor 1
Processor 2
Better, but what if more than two processors?
MPI_Sendrecv
MPI_Allreduce
Parallel performance is limited by:
Overcome these limits by understanding common patterns of parallelism and locality in applications.
Can get more parallelism / locality through modeling
Often get parallelism at multiple levels
More about parallelism and locality in simulations!