Graphs—sets of nodes connected by edges—are extremely important mathematical objects that arise quite often in real-life applications. For instance:
In this chapter, we will discuss some basic properties of graphs, and present some simple algorithms for exploring them. (We return to some more advanced algorithms in chapter 6.)
2.1 Basic Definitions
Let us start by defining the notions of directed and undirected graphs. See Figure 2.1 for an illustration.
Figure 2.1: A basic example of directed and undirected graphs.
Definition 2.1. A directed graph, or simply graph, , is a pair where is a set of vertices (or nodes), and is a set of edges. We say that is an undirected graph if for every pair of nodes , if and only if (i.e., all edges are “bidirectional”).
Given an undirected graph , to simplify notation, we often represent the pair of edges as a single unordered pair ; that is, for undirected graphs , . Given an edge in a graph , we say that is outgoing from and incoming to . Note that it can be the case that in which case we refer to the edge as a self-loop. When we refer to the size of a graph (denoted ), we mean the number of vertices in the graph; that is, is defined as . A complete graph is a graph where ; that is, we have an edge between any two nodes .
A subgraph of a graph is a smaller graph on a subset of nodes from consistent with the edges from . More formally,
Definition 2.2. Given a graph , a subgraph of is a graph where and . We use the notation to denote that is a subgraph of .
We turn to defining the notion of node degree:
Definition 2.3. In a graph , the in-degree of a node , denoted , is the number of edges incoming to (i.e., edges of the form ); the out-degree of , denoted , is the number of edges outgoing from (i.e., edges of the form ). The degree of , denoted , is the sum of the in-degree and the out-degree of .
2.2 Connectivity and Reachability
We turn to consider the notion of connectivity. We first define what it means for a node to be reachable from (or connected to) a node .
Definition 2.4. A path (or walk) in a graph is a sequence of vertices such that there exists an edge between any two consecutive vertices (i.e., for ). We refer to as the length of the path. If and (i.e., the starting vertex is the same as the ending vertex), we refer to the path as a cycle. We say that a node is reachable from (or connected to ) if there exists a path from to .
A graph without cycles is called acyclic. We turn to defining what it means for a graph to be connected. For simplicity, we will restrict our attention to undirected graphs.
Definition 2.5. An undirected graph is connected if there exists a path between any two nodes . (Note that a graph containing a single node is considered connected via the length 0 path .) If an undirected graph is not connected, we say that it is disconnected.
When an undirected graph is disconnected, we may decompose the graph into smaller connected components.
Definition 2.6. A connected component of an undirected graph is a maximal connected subgraph—that is, a subgraph that is connected, and any larger subgraph (satisfying , ) must be disconnected.
See Figure 2.2 for an illustration of connected components of a graph.
Figure 2.2: Example of connected components.
Let us note an intersting application of connectivity to social networks. It has been experimentally observed that in social networks (e.g., Facebook), although not everyone is in the same connected component, most people typically are: we informally refer to this component as the giant component. One reason for why most people are part of the same giant component is that if we had two (or more) large components in the graph, then all it takes to “merge” two large components into a single larger component is for one of the people in each component to become friends—it is very unlikely that this does not happen for any such pair of people. (This argument can be formalized in Erdős and Renyi’s “random graph” model [ER59, Bol84]; we discuss the random graph model in more detail below.)
2.3 Breadth-first Search and Shortest Paths
How do we check if there is a path from to ? One efficient way to do this is with breadth-first search (BFS). Breadth first search is a basic graph search algorithm that given a graph and a starting node proceeds as follows: visit all the neighbors of , then visit all the currently unvisited neighbors of those nodes , and so on and so forth, while keeping track of which nodes have previously been visited. More precisely,
See Figure 2.3 for an illustration of the execution of the BFS algorithm.
Figure 2.3: An illustration of breadth-first search. Starting from node , in the first step we visit all of the neighbors of (marked with a 1), in the second step we visit all of the unvisited neighbors (marked with a 2) of each node visited in the first step, and so on. Notice that, in four steps, BFS visits every node except for the top-right node (unlabeled), which clearly has no path from .
Claim 2.1. For all vertices , if there exists a path from to , the BFS algorithm starting from vertex traverses some shortest path from to (and this shortest path can be retrieved by following the “back-pointers” from ).
Proof. Assume for contradiction that there exists a path from to , but that the BFS algorithm does not traverse any shortest path from to . Let be a shortest path from to , and let be the first node on the path such that BFS does not traverse a path of length (at most) from to . Such a node must exist because, by assumption, BFS does not traverse a path of length from and ; additionally , since is the starting point.
Since is the first node on the path for which BFS fails to traverse a path of length at most , it must have traversed a path of length at most to get to . Consequently, after visiting , BFS will visit since it is a neighbor of that still is unvisited at this point (or else BFS would have traversed a path of length at most to get from to ). Thus, BFS traverses a path of length at most from to , which is a contradiction.
■
As a consequence of the above claim, and observing that BFS will only visit each node once (and thus its running time is polynomial in the number of nodes in the graphs), we have the following theorem.
Theorem 2.1. Shortest paths in a graph can be found in polynomial time in .
Since running BFS from a node visits all the nodes that are reachable from (by Claim 2.1), we also have the following result.
Theorem 2.2. Given any node in , the set of nodes that are reachable from in can be found in polynomial time in .
We can also use BFS to find the connected components of an undirected graph : Start at any node and use BFS to find the set of nodes that are reachable from ; by the definition of reachability, the subgraph corresponding to those nodes is the maximal connected subgraph containing and thus a connected component. Mark all those nodes as component 1. Next, take some currently unmarked node , and again use BFS to find the set of nodes that are reachable from ; mark all those notes as component 2. Continue in the same manner until all nodes have been marked.
Some applications of BFS and shortest paths to social networks include the following:
To get some intuition for why small-world effects happen, let us consider Erdős and Renyi’s “random graph” model [ER59]:
The random graph model Consider a graph with nodes, where every pair of nodes has an edge between them (determined independently and with no self-loops) with probability . To gain some intuition about how many nodes have “short paths” between them, we start by asking the following question: What is the probability that two vertices and have a path of length at most 2 between them?
Given any specific third vertex , the probability of a path is only . However, since there are possible “third vertices” for this path, the probability that two nodes are not connected by any path of length exactly 2 is
By the Union Bound (see Corollary A.2 in Appendix A), we thus have that the probability that there exists some pair of nodes that is more than distance 2 apart is
This quantity decreases very quickly as the number of vertices, , increases. Therefore, it is extremely likely that every pair of nodes is at most distance 2 apart.
Needless to say, in a real social network, people are far less likely than probability to be connected, but the number of nodes is extremely large. Hence, similar arguments apply to show that the average separation between two nodes is relatively small.
Notes
We refer the reader to [CLRS09] and [KT05] for a more in-depth study of graph algorithms, and to [EK10] for connections between graph properties and social networks.