CS 5220

Applications of Parallel Computers

Parallel graph algorithms

Please click the play button below.

Plan

Some background on graphs
Applications and building blocks
Basic parallel graph algorithms
Representations and performance
Graphs and LA
Frameworks

We have a bit of a potpourri today. After reminding you about different types of graphs and their applications to various problems, we’ll talk about basic parallel graph algorithms. This is different from our earlier discussion of graph theory for load balancing, in that this time we are talking about actually parallelizing the graph computations instead of using them to reason about another parallel computation! We’ll do a couple examples here to highlight some common ideas that show up when transitioning from the serial to the parallel settings. In particular, randomization shows up often.

In the back half of the latter, we’ll switch from talking about algorithms to talking about some of the nuts-and-bolts of making things run fast on modern machines. That means understanding representations of graphs that we might want to use, and also frameworks that allow us to program graph algorithms fast and at scale (or that people think allow us to do so).

Graphs

Mathematically: \(G = (V,E)\) where \(E \subset V \times V\)

Convention: \(|V| = n\) and \(|E| = m\)
May be directed or undirected
May have weights \(w_V : V \rightarrow \mathbb{R}\) or \(w_E : E : \rightarrow \mathbb{R}\)
May have other node or edge attributes as well
Path is \(\left[ \, (u_i,u_{i+1}) \, \right]_{i=1}^\ell \in E^*\), sum of weights is length
Diameter is \(\max_{s, t \in V} d(s, t)\)

Generalizations

Hypergraph (edges in \(V^d\))
Multigraph (multiple copies of edges)

Types of graphs

Many possible structures:

Lines and trees
Completely regular grids
Planar graphs (no edges need cross)
Low-dimensional Euclidean
Power law graphs
...

Algorithms are not one-size-fits-all!

Ends of a spectrum

	Planar	Power law
Vertex degree	Uniformly small	\(P(\mathrm{deg} = k) \sim k^{-\gamma}\)
Radius	\(\Omega(\sqrt{n})\)	Small
Edge sep	\(O(\sqrt{n})\)	nothing small
Linear solve	Direct OK	Iterative
Apps	PDEs	Social networks

Calls for different methods!

Applications: Routing and shortest paths

Applications: Traversal, ranking, clustering

Web crawl / traversal
PageRank, HITS
Clustering similar documents

Applications: Sparse solvers

Ordering for sparse factorization
Partitioning
Coarsening for AMG

Applications: Dimensionality reduction

Common building blocks

Traversals
Shortest paths
Spanning tree
Flow computations
Topological sort
Coloring
...

... and most of sparse linear algebra.

Over-simple models

Let \(t_p =\) idealized time on \(p\) processors

\(t_1 =\) work
\(t_\infty =\) span (or depth, or critical path length)

Don’t bother with parallel DFS! Span is \(\Omega(n)\).
Let’s spend a few minutes on more productive algorithms...

Serial BFS

Push seed node onto queue and mark
While Q nonempty
- Pop node from queue
- Visit node
- Push unmarked neighbors on queue
- Mark all neighbors

Parallel BFS

Simple idea: parallelize across frontiers

Pro: Simple to think about
Pro: Lots of parallelism with small radius?
Con: What if frontiers are small?

Parallel BFS: Ullman-Yannakakis

Assuming a high-diameter graph:

Form set \(S\) with start + random nodes, \(|S| = \Theta(\sqrt{n} \log n)\)
- long shortest paths go through \(S\) w.h.p.
Take \(\sqrt{n}\) steps of BFS from each seed in \(S\)
Form aux graph for distances between seeds
Run all-pairs shortest path on aux graph

OK, but what if diameter is not large?

An idea due to Ullman and Yanakakkis involves parallel exploration of different parts of the graph, followed by a method for connecting things together. The idea is to take the start node together with a lot of other randomly selected nodes, and expand a small BFS of length sqrt(n) from each of those seeds. Then we show that, with high probability, any shortest path that is longer than sqrt(n) must go through one of the seeds. From there, we only need to consider the graph of seed-to-seed distances, which we can compute with an all-pairs shortest path computation on an auxiliary graph.

The good thing about this algorithm is that it has lots of parallelism and is relatively simple to code. The bad thing is that the analysis is not so intuitive, and it only works for graphs with large diameter.

So what should we do if we want to do breadth-first search in a “small world” graph?

Serial BFS: Bottom-up

Set \(d[v] = \infty\) for all vertices
Set \(d[s] = 0\) for seed \(s\)
Until \(d\) stops changing
- For each \(u \in V\)
  - \(d[u] = \min(d[u], \min_{w \in N(u)} d[w]+1)\)

Parallel BFS

Key ideas:

At some point, switch from top-down expanding frontier (“are you my child?”) to bottom-up checking for parents (“are you my parent?”)
Use 2D blocking of adjacency

Single-source shortest path

Classic algorithm: Dijkstra

Dequeue closest point to frontier, expand frontier
Update priority queue of distances (in parallel)
Repeat

Or run serial Dijkstra from different sources for APSP.

Alternate idea: label correcting

Initialize \(d[u]\) with distance over-estimates to source

\(d[s] = 0\)
Repeatedly relax \(d[u] := \min_{(v,u) \in E} d[v] + w(v,u)\)

Converges (eventually) as long as all nodes visited repeatedly, updates are atomic. If serial sweep in a consistent order, call it Bellman-Ford.

Single-source shortest path: \(\Delta\)-stepping

Alternate approach: hybrid algorithm

Process a “bucket” at a time
Relax “light” edges (wt < \(\Delta\)), might add to bucket
When bucket empties, relax “heavy” edges a la Dijkstra

Maximal independent sets (MIS)

\(S \subset V\) independent if none are neighbors.
Maximal if no others can be added and remain independent.
Maximum if no other MIS is bigger.
Maximum is NP-hard; maximal is easy (serial)

Simple greedy MIS

Start with \(S\) empty
For each \(v \in V\) sequentially, add \(v\) to \(S\) if possible.

Luby’s algorithm

Init \(S := \emptyset\)
Init candidates \(C := V\)
While \(C \neq \emptyset\)
- Label each \(v\) with a random \(r(v)\)
- For each \(v \in C\) in parallel, if \(r(v) < \min_{\mathcal{N}(v)} r(u)\)
  - Move \(v\) from \(C\) to \(S\)
  - Remove neighbors from \(v\) to \(C\)

Very probably finishes in \(O(\log n)\) rounds.

Luby’s algorithm (round 1)

Luby’s algorithm (round 2)

A fundamental problem

Many graph ops are

Computationally cheap (per node or edge)
Bad for locality

Memory bandwidth as a limiting factor.

At this point in the lecture, perhaps you’re starting to get suspicious. All the graph theory and randomization ideas and bottom-up-vs-top-down stuff sounds very much like what you’d see in an algorithms class – or maybe a parallel algorithms class – but it’s missing a lot of what we often focus on when we do HPC. What about the data structures? Issues of memory locality and computational intensity?

Well, we’ll talk about the data structures in a moment. In terms of memory locality and computational intensity, though, the news is mostly bad. Many graph operations are computationally cheap per node or edge, requiring only one or a small number of visits before completing (and with each of the visits only involving cheap operations). So the main bottleneck is generally not computing on the graph, but getting the graph out of memory.

Big data?

Consider:

323 million in US (fits in 32-bit int)
About 350 Facebook friends each
Compressed sparse row: about 450 GB

Topology (no metadata) on one big cloud node...

There’s good news here, though, too. Compared to many of the problems we’re used to dealing with even “big” graphs may not be that big. For example, consider the Facebook social graph in the US (or the graph for a comparable network – I don’t mean to be old-fashioned here, I just happen to know more numbers in this case). We can identify everyone in the US by a 32 bit int, and if we just store connectivity (assuming about 350 friends per person), the entire topology can be represented in compressed sparse row form with less than half a terabyte of memory. That may seem like a lot, but it easily fits on a single big machine; maybe not your laptop, but a cloud node that you can easily get time on.

So – is CSR the Right Way to represent things? The answer, of course, is “it depends.” Let’s talk about a few options.

Graph rep: Adj matrix

Pro: efficient for dense graphs
Con: wasteful for sparse case...

Graph rep: Coordinate

Tuples: \((i,j,w_{ij})\)
Pro: Easy to update
Con: Slow for multiply

Graph rep: Adj list

Linked lists of adjacent nodes
Pro: Still easy to update
Con: May cost more to store than coord?

Graph rep: CSR

Pro: traversal? Con: updates

Graph rep: implicit

Idea: Never materialize a graph data structure
Key: Provide traversal primitives
Pro: Explicit rep’n sometimes overkill for one-off graphs?
Con: Hard to use canned software (except NLA?)

Graph algorithms and LA

Really is standard LA
- Spectral partitioning and clustering
- PageRank and some other centralities
- “Laplacian Paradigm” (Spielman, Teng, others...)
Looks like LA
- Floyd-Warshall
- Breadth-first search?

We have been flirting throughout the lecture with the relationship between graph theory and linear algebra. Let’s take a moment now to clarify this relationship.

Lots of graph operations that we care about really are based on standard linear algebra over the real numbers. This includes various algorithms for partitioning, clustering, and ranking, as well as anything that can be built on the “graph Laplacian paradigm” popularized by Spielman, Teng, and others.

At the same time, there are also operations that end up looking suspiciously like linear algebra in structure, even if the details aren’t quite right. Examples include the Floyd-Warshall algorithm for all-pairs shortest path, or the bottom-up breadth-first search algorithm. Well, it turns out that these may not exactly be linear algebra, but they aren’t exactly not linear algebra, either.

Perhaps that deserves some more explanation…

Graph algorithms and LA

Semirings have \(\oplus\) and \(\otimes\) s.t.

Addition is commutative+associative with a 0
Multiplication is associative with identity 1
Both are distributive
\(a \otimes 0 = 0 \otimes a = 0\)
But no subtraction or division

Technically modules over semirings

In abstract algebra, we have vector spaces over fields. The most frequently-used fields are the real and complex numbers or subsets thereof (the rationals, the algebraic numbers, etc), but there are also plenty of applications of finite field arithmetic. But we can still do a lot with a weaker abstraction than a field called a semi-ring. A semi-ring has addition and multiplication, each with the usual identity element. Addition is commutative and associative, multiplication is associative (but maybe not commutative), and the two together satisfy the usual distributive law. But we have no subtraction (unlike a ring), and no division (unlike a division ring or a field).

We have something like a vector space in this setting, too. It just happens to be called a module rather than a vector space in the case when you don’t have an underlying field.

Have I jumped off the deep end and started talking pure math when I should be telling you about HPC? Well, bear with me for one more slide, and perhaps things will become more clear.

Graph algorithms and LA

Example: min-plus

\(\oplus = \min\) and additive identity \(0 \equiv \infty\)
\(\otimes = +\) and multiplicative identity \(1 \equiv 0\)
Useful for shortest distance: \(d = A \otimes d\)

Graph BLAS

http://www.graphblas.org/

Version 1.3.0 (final) as of 2019-09-25
(Opaque) internal sparse matrix data structure
Allows operations over misc semirings

Graph frameworks

Several to choose from!

Pregel, Apache Giraph, Stanford GPS, ...
GraphLab family
- GraphLab: Original distributed memory
- PowerGraph: For “natural” (power law) networks
- GraphChi: Chihuahua – shared mem vs distributed
Outperformed by Galois, Ligra, BlockGRACE, others
But... programming model was easy
GraphIt - best of both worlds?

Alas, not everyone thinks that linear algebra is the right way to think about graph algorithms! Over the past decade or so, a lot of work has been done on other frameworks for programming graph algorithms “at scale.” Maybe the earliest such was Google’s Pregel system, followed by the GraphLab family of systems from CMU. Later frameworks out-performed these earlier systems. Indeed, I was involved in some of this work – the GRACE system was Cornell’s entry into this world, back when Johannes Gehrke was still on the faculty, and I was involved in work that added blocked algorithms into GRACE. But some of the performance improvements came at the cost of the so-called “think like a vertex” abstraction that the original designers of Pregel and GraphLab liked so much.

As a brief aside: there’s a project at MIT called GraphIt that involves a domain-specific language for programming graph algorithms that have high performance. It seems like the logical successor to the Pregel/GraphLab/etc line of work, but it has much higher performance than those systems did.

Graph frameworks

“Think as a vertex”
- Each vertex updates locally
- Exchanges messages with neighbors
- Runtime actually schedules updates/messages
Message sent at super-step \(S\) arrives at \(S+1\)
Looks like BSP

At what COST?

“Scalability! But at what COST?”
McSherry, Isard, Murray, HotOS 15

You can have a second computer once you’ve shown you know how to use the first one.
– Paul Barham (quoted in intro)

Configuration that Outperforms a Single Thread
Observation: many systems have unbounded COST!

One of my very favorite things to come out of some of the craze in the early 2010s for simple-and-scalable analytics frameworks — things like Pregel and MapReduce and the like — is watching people eventually realize that there’s more to performance than parallelism, and that often a well-written code on a laptop could do what industry players were attempting to do with a framework code distributed across a giant cluster or cloud environment. McSherry, Isard, and Murray made the point very nicely in this 2015 paper, where they talked about Configuration that Outperforms a Single Thread. The punchline for the paper was that in many cases, there was no configuration that outperforms a single thread! This is not to be too dismissive of the framework work, which often was pretty good about parallelizing disk head use (which is not an entirely trivial matter). But it is good for perspective.

My personal punchline to all this: if I were trying to do high-performance combinatorial graph operations for something these days, I would probably reach for GraphBLAS before reaching for any of the graph processing engine frameworks.