Introduction and Performance Basics
2024-08-27
Title: Applied High-Performance and Parallel Computing
Web: https://www.cs.cornell.edu/courses/cs5220/2024fa
When: TR 1:25-2:40
where: Gates G01
Who: David Bindel, Caroline Sun, Evan Vera
Basic logistical constraints:
Fine if you’re not a numerical C hacker!
Reason about code performance
Learn about high-performance computing (HPC)
Apply good software practices
Introduce yourself to a neighbor:
Jot down answers (part of HW0).
Scientific computing went parallel long ago:
Today: Hard to get non-parallel hardware!
Speed records for Linpack benchmark
Speed measured in flop/s (floating point ops / second):
What do these machines look like?
An alternate benchmark: Graph 500
What do these machines look like?
So how fast can I make my computation?
See also David Bailey’s comments:
How can we speed up summing an array of length \(n\) with \(p \leq n\) processors?
\[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\ \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]
Ideally, speedup = \(p\). Usually, speedup \(< p\).
Barriers to perfect speedup:
\[\begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}\]
Amdahl’s law: \[\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} > \frac{1}{s}\]
So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\), regardless of \(p\).
Let’s try a simple parallel attendance count:
Parallel computation: Rightmost person in each row counts number in row.
Synchronization: Raise your hand when you have a count
Communication: When all hands are raised, each row representative adds their count to a tally and says the sum (going front to back).
(Somebody please time this.)
Parameters: \[\begin{aligned} n = & \mbox{ number of students} \\ r = & \mbox{ number of rows} \\ t_c = & \mbox{ time to count one student} \\ t_t = & \mbox{ time to say tally} \\ t_s \approx & ~n t_c \\ t_p \approx & ~n t_c / r + r t_t \end{aligned}\]
How much could I possibly speed up?
Student count:
function tserial(n, r) { return n * 0.3; }
function tparallel(n, r) { return n * 0.3 / r + r * 1.0; }
function speedup(n, r) { return tserial(n,r) / tparallel(n,r); }
rows = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20];
data = ({
"rows" : rows,
"speedup" : rows.map((r) => speedup(nstudents,r))
})
(Parameters: \(t_c = 0.3\), \(t_t = 1\).)
Mostly-tight bound: \[\mathrm{speedup} < \frac{1}{2} \sqrt{\frac{n t_c}{t_t}}\]
Poor speed-up occurs because:
Some of the usual suspects for parallel performance problems!
Things would look better if I allowed both \(n\) and \(r\) to grow — that would be a weak scaling study.
This probably does not make sense for a classroom setting…
Today:
http://www.cs.cornell.edu/courses/cs5220/2024fa/
... and please enroll and submit HW0!