CS 5220

Applications of Parallel Computers

Please click the play button below.

The Computational Science & Engineering Picture

Applications Everywhere!

Climate modeling
CAD tools (computers, buildings, airplanes, ...)
Computational biology
Computational finance
Machine learning and statistical models
Game physics and movie special effects
Medical imaging

Question for Discussion

Take a minute to Google "HPC X" where X is your favorite application. What comes up?

If you have no favorite applications, you might poke through the front page of HPCWire to see some things that others care about!

Why Parallel Computing?

Scientific computing went parallel long ago

Want an answer that is right enough, fast enough
Either of those might imply a lot of work!
We like to ask for more as machines get bigger
We have a lot of data, too

All right, you're back. I hope you found an application that tickled your interest. When we talk about high-performance computing, we often (though not always!) mean trying to scale computations up to big parallel computers with lots of processors. This has been a revered path forward in scientific computing for a long time. When we simulate physical systems, we often need a lot of resolution, and that's expensive. For 3D time-stepping problems, we often have to scale the work with every dimension, so simulating on a 1-meter grid might be ten thousand times more expensive than simulating on a 10-meter grid. Those expensive simulations also generate a lot of data, and parallelism may be necessary both for the computation and for storing that data and crunching it down to a reasonable size.

Why Parallel Computing?

Today: Hard to get a non-parallel computer!

How many cores are in your laptop?
How many in NVidia's latest accelerator?
What's the biggest single node EC2 instance?

Lecture Plan

Basics: architecture, parallel concepts, locality and parallelism in scientific codes
Technology: OpenMP, MPI, CUDA/OpenCL, cloud systems, compilers and tools
Patterns: Monte Carlo, dense and sparse linear algebra and PDEs, graph partitioning and load balancing, fast multipole, fast transforms

All right, welcome back from your Googling. We've talked a bit about the why of the class. Now, let's talk about the topics that we will cover and the overarching themes that I hope will come across. The semester involves three parts, though they interlace with each other. In the first part of the semester, we will cover basic ideas of high-performance computing, including some elements of computer architecture, basic parallel concepts, and ideas of how to think about parallelism and locality in scientific codes. In the second part of the semester, we will talk about programming technologies for high-performance computing. That means MPI and OpenMP, but also how to effectively use tools like compilers and profilers. Finally, we will talk about common algorithmic pattern in high-performance scientific computing, including pleasingly parallel workloads, dense and sparse linear algebra, numerical PDE methods, graph partitioning and load balancing, fast multipole methods, and fast transforms.

Objectives

Reason about code performance

Many factors: HW, SW, algorithms
Want simple “good enough” models

Objectives

Learn about high-performance computing (HPC)

Learn parallel concepts and vocabulary
Experience parallel platforms (HW and SW)
Read/judge HPC literature
Apply model numerical HPC patterns
Tune existing codes for modern HW

Objectives

Apply good software practices

How Fast Can We Go?

Speed records for the Linpack benchmark:

http://www.top500.org

Speed measured in flop/s (floating point ops / second):

Giga (\(10^9\)) – a single core
Tera (\(10^{12}\)) – a big machine
Peta (\(10^{15}\)) – current top 10 machines
Exa (\(10^{18}\)) – favorite of funding agencies

All right. Before we start trying to make things go fast, it's always worth understanding the fundamental limits we might run into. A good way to learn about the speed of light for different classes of machines is to look at the top 500 machines according to the Linpack benchmark, which tests how fast different machines can solve giant linear systems of equations. To solve a system of n equations in n unknowns takes us about n^3 / 3 floating point adds and a similar number of multiplies. We call these operations flops, short for "floating point operations." Your laptop is a gigaflop machine, and you can easily get access to teraflop machines. Petaflop machines still count as pretty big, and the funding agencies are all spending time these days talking about how we should get to exaflops. There are some people who think that we are at exaflops already, but they usually use a definition that some of us consider cheating -- this is only with lower-precision arithmetic than standard IEEE single or double precision.

Fujitsu Fugaku

Look at the report. What does it say about:

Peak flop rate, Linpack rate, HPCG rate?
Energy use and cooling?
Individual processor architecture?
Network organization?
Software stack?

Alternate: Graph 500

Graph processing benchmark (data-intensive)

Metric: traversed edges per second (TEPS)
What is Fujitsu Fugaku in GTEPS?
How do the top machines compare between Top 500 and Graph 500?

Punchline

Some high-end machines look like high-end clusters
- Except custom networks.
- And then some machines look very different.
Achievable performance is
- \(\ll\) peak performance
- Application-dependent
Peak is hard on more modest platforms, too!

If you poke through the descriptions of the Top 500 machines, you will see that many of them look similar to clusters that you might have encountered, with commodity processors and memory at each node. The thing that makes them really different is the custom networks. At least, that was the case for a long time. We seem to be entering another period where people are trying out different creative hardware solutions, and Intel does not dominate the entire world quite as much as it once did. The other thing that you will see as you look through this data is that it is not so easy to get anywhere close to the theoretical peak performance of a modern supercomputer. For that matter, it is not so easy to get close to the peak performance of much more modest machines, either! The best we can do depends a lot on the application, and dense linear algebra -- as in Linpack -- is easier to make fast than many other computational patterns are.

Practical Performance

So how fast can I make my computation?

Peak \(>\) Linpack \(>\) Gordon Bell \(>\) Typical

Practical Performance

Measuring performance of real applications is hard

What figure of merit (flops, TEPS, ...?)
Typically a few bottlenecks slow things down
Why they slow down can be tricky!

Practical Performance

Really care about time-to-solution

Sophisticated methods get answer in fewer flops
... but may look bad in benchmarks (lower flop rates!)

Practical Performance

Quantifying Performance

Starting point: good serial performance.

Quantifying Performance

Strong scaling: compare parallel to serial time on the same problem instance as a function of number of processors (\(p\)) \[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm] \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]

Quantifying Performance

Ideally, speedup = \(p\). Usually, speedup \(< p\), because:

Serial work (Amdahl’s law)
Parallel overheads (communication, synchronization)

Amdahl’s Law

\[\begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}\]

\[\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} < \frac{1}{s}\]

So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\), regardless of \(p\).

A Thought Experiment

Let’s try a simple parallel attendance count:

Parallel computation: Rightmost person in each row counts number in row.
Synchronization: Raise your hand when you have a count
Communication: When all hands are raised, each row representative adds their count to a tally and says the sum (going front to back).

A Toy Analysis

Parameters: \[\begin{aligned} n = & \mbox{ number of students (80)} \\ r = & \mbox{ number of rows} \\ t_c = & \mbox{ time to count one student (0.3)} \\ t_t = & \mbox{ time to say tally (1)} \\ t_s \approx & ~n t_c \\ t_p \approx & ~n t_c / r + r t_t \end{aligned}\]

How much could I possibly speed up?

A Toy Analysis

Modeling Speedup

\[\mathrm{speedup} < \frac{1}{2} \sqrt{\frac{n t_c}{t_t}}\]

The problem size \(n\) is small
The communication cost is relatively large
The serial computation cost is relatively large

Common suspects for parallel performance problems!

Summary: Thinking about Parallel Performance

We have (arguably) exaflop machines
But codes rarely get peak performance
Better comparison: tuned serial performance
Common measures: speedup and efficiency

Summary: Thinking about Parallel Performance

Strong scaling: study speedup with increasing \(p\)
Weak scaling: increase both \(p\) and \(n\)
Serial overheads, communication kill speedup
Simple models help us understand scaling