CS 5220

Applications of Parallel Computers

Parallel Machines and Models

Prof David Bindel

Please click the play button below.

Parallel computer hardware

Have processors, memory, interconnect.

  • Where is memory physically?
  • Is it attached to processors?
  • What is the network connectivity?

Parallel programming model

Programming model through languages, libraries.

  • What are the control mechanisms?
  • What data semantics? Private, shared?
  • What synchronization constructs?

For performance, need cost models!

Simple example

double dot(int n, double* x, double* y)
{
    double s = 0;
    for (int i = 0; i < n; ++i)
        s += x[i] * y[i];
    return s;
}

Simple example

double pdot(int n, double* x, double* y)
{
    double s = 0;
    for (int p = 0; p < NUM_PROC; ++p) { // Loop to parallelize
      int i = p*n/NUM_PROC;
      int inext = (p+1)*n/NUM_PROC;
      double partial = dot(inext-i, x+i, y+i);
      s += partial;
    }
    return s;
}

Simple example

How can we parallelize?

  • Where do arrays \(x\) and \(y\) live? One CPU? Partitioned?
  • Who does what work?
  • How do we combine to get a single final result?

Shared memory model

Shared memory model

Program consists of threads of control.

  • Can be created dynamically
  • Each has private variables (e.g. local)
  • Each has shared variables (e.g. heap)
  • Communication through shared variables
  • Coordinate by synchronizing on variables
  • Examples: OpenMP, pthreads

Shared memory dot product

Dot product of two \(n\) vectors on \(p \ll n\) processors:

  1. Each CPU: partial sum (\(n/p\) elements, local)
  2. Everyone tallies partial sums

Can we go home now?

Race condition

A race condition:

  • Two threads access same variable
  • At least one write.
  • Access are concurrent – no ordering guarantees
    • Could happen simultaneously!

Race to the dot

Consider S += partial on two CPUs

Race to the dot

P1

P2

load S
add partial
load S
store S
add partial
store S

Sequential consistency

  • Idea: Looks like processors take turns, in order
  • Convenient for thinking through correctness
  • Really hard for performance!
  • Will talk about memory models later

Shared memory dot with locks

Solution: consider S += partial_sum a critical section

  • Only one CPU at a time allowed in critical section
  • Can violate invariants locally
  • Enforce via a lock or mutex

Shared memory dot with locks

Dot product with mutex:

  1. Create global mutex l
  2. Compute partial_sum
  3. Lock l
  4. S += partial_sum
  5. Unlock l

A problem

Processor 1:

  1. Acquire lock 1
  2. Acquire lock 2
  3. Do something
  4. Release locks

Processor 2:

  1. Acquire lock 2
  2. Acquire lock 1
  3. Do something
  4. Release locks

What if both processors execute step 1 simultaneously?

Shared memory with barriers

  • Lots of sci codes have phases (e.g. time steps)
  • Communication only needed at end of phases
  • Idea: synchronize on end of phase with barrier
    • More restrictive (less efficient?) than small locks
    • But easier to think through! (e.g. less chance of deadlocks)
  • Sometimes called bulk synchronous programming

Dot with barriers

  1. partial[threadid] = local partial sum
  2. barrier
  3. sum = sum(partial)

Punchline

Shared memory correctness is hard

  • Too little synchronization: races
  • Too much synchronization: deadlock

And this is before we talk performance!

Shared memory machines

Uniform shared memory

  • Processors and memories talk through a bus
  • Symmetric Multiprocessor (SMP)
  • Hard to scale to lots of processors (think \(\leq 32\))
    • Bus becomes bottleneck
    • Cache coherence via snooping

Multithreaded processor machine

  • Maybe threads > processors!
  • Idea: Switch threads on long latency ops.
  • Called hyperthreading by Intel
  • Cray MTA was an extreme example

Distributed shared memory

  • Non-Uniform Memory Access (NUMA)
  • Memory logically shared, physically distributed
  • Any processor can access any address
  • Close accesses are faster than far accesses
  • Cache coherence is still a pain
  • Most big modern chips are NUMA
  • Many-core accelerators tend to be NUMA as well

Punchline

Shared memory is expensive!

  • Uniform access means bus contention
  • Non-uniform access scales better
    (but now access costs vary)
  • Cache coherence is tricky
  • May forgo sequential consistency for performance

Message passing model

Message-passing programming model

  • Collection of named processes
  • Data is partitioned
  • Communication by send/receive of explicit message
  • Lingua franca: MPI (Message Passing Interface)

Message passing dot product: v1

Processor 1:

  1. Partial sum s1
  2. Send s1 to P2
  3. Receive s2 from P2
  4. s = s1 + s2

Processor 2:

  1. Partial sum s2
  2. Send s2 to P1
  3. Receive s1 from P1
  4. s = s1 + s2

What could go wrong? Think of phones vs letters...

Message passing dot product: v1

Processor 1:

  1. Partial sum s1
  2. Send s1 to P2
  3. Receive s2 from P2
  4. s = s1 + s2

Processor 2:

  1. Partial sum s2
  2. Receive s1 from P1
  3. Send s2 to P1
  4. s = s1 + s2

Better, but what if more than two processors?

MPI: the de facto standard

  • Pro: Portability
  • Con: least-common-denominator for mid 80s

The “assembly language” (or C?) of parallelism...
but, alas, assembly language can be high performance.

Punchline

  • Message passing hides less than shared memory
  • But correctness is still subtle

Distributed memory machines

Distributed memory machines

  • Each node has local memory
    • ... and no direct access to memory on other nodes
  • Nodes communicate via network interface
  • Example: most modern clusters!

Back of the envelope

  • c is 3 billion m/s.
  • One light-ns is about 0.3 m
    (about a foot)
  • A big machine is often over 300 feet across
  • May still be dominated by NIC latency (microseconds)
  • Across a big machine will always be order(s)-of-magnitude slower than local memory accesses
  • Another reason locality matters!

Paths to Parallel Performance

Reminder: what do we want?

  • High-level: solve big problems fast
  • Start with good serial performance
  • Given \(p\) processors, could then ask for
    • Good speedup: \(p^{-1}\) times serial time
    • Good scaled speedup: \(p \times\) the work, same time
  • Easiest to get speedup from bad serial code!

The story so far

Parallel performance is limited by:

  • Single-core performance
  • Communication and synchronization costs
  • Non-parallel work (Amdahl)

Plan now: talk about how to overcome these limits for some types of scientific applications

Parallelism and locality

Can get more parallelism / locality through model

  • Limited range of dependency between time steps
  • Can neglect or approximate far-field effects

Parallelism and locality

Often get parallism at multiple levels

  • Hierarchical circuit simulation
  • Interacting models for climate
  • Parallelizing individual experiments in MC or optimization

Next up

Parallel patterns in simulation

  • Discrete events and particle systems
  • Differential equations (ODEs and PDEs)