Lecture 11:
Properties of Large Graphs

Preliminaries about Graphs

A directed graph G consists of a set of nodes (or vertices) V and a set of edges (or arcs) E.  Each edge (u,v) in E is an ordered pair of nodes representing a connection from node u to node v.  The out-degree of a node u is the number of edges (u,vi) in E for all nodes vi in V.  Conversely the in-degree of a node u is the number of edges (vi,u) in E for all nodes vi in V.  Note that the number of edges |E| can be up to |V|2.  A graph where |E| << |V|2 is called sparse, whereas a graph where |E| |V|2 is called dense.

The edges between nodes may have weights or other annotations. For instance it is common to denote a weight for edge (u,v) by wu,v.  Similarly the nodes may be annotated in some form.

A common representation of a graph is the adjacency list, which for each node u in V lists all nodes vi in V such that there is an edge (u,vi). If each node is identified by a natural number, then a vector of lists can be used, where the i-th entry of the vector contains a list corresponding to the set of nodes such that there is an edge (i,j) for each number j in the list. Adjacency lists are generally a reasonable representation for graphs when the out-degree of the nodes is relatively small (i.e,  when the graph is small or when the out degree is generally << |V|), in which case the lists for each node are short. For example, for sparse graphs adjacency lists can often be a good representation.

Consider the following simple example:
 

The adjacency list representation is
0:
1: 2 3
2: 1 3
3: 4
4: 0 1

Note that the list for each node corresponds precisely to the set of out links for that node.

A path from a node u to v is is a sequence of edges (u,u1), (u1,u2), ..., (uk,v). One can think of traversing the edges from node to node to get from the start u to the destination v. The length of a path is the number of edges in the path.  Note that there may be multiple paths from u to v and that there need not be any path from u to v.  Note also that there may be a path from u to v and not vice versa (or the opposite). A shortest path, which need not be unique, is one that minimizes the number of edges among all possible paths from u to v (if any). The distance from u to v is the length of the shortest path, if it exists, and otherwise is infinity (the distance from a node to itself is 0).  For example in the graph above (1,3),(3,4),(4,0) is a shortest path from 1 to 0 and the distance is 3.

A strongly connected component (SCC) of a directed graph is a maximal set of nodes CV such that for every pair of vertices vi,vj in C there is a path from vi to vj (note this means there is also a path from vj to vi, because all pairs of vertices are considered).  Note the property of being a maximal set means that C is the largest set such that this property holds. Thus if C is a SCC of a graph G then none of the subsets of C are SCC's. What is an SCC in the example graph above?

An undirected graph G consists of a set of nodes (or vertices) V and a set of edges (or arcs) E where each edge {u,v} in E is an unordered pair of nodes representing a connection between nodes u and v.  Note that every directed graph defines a corresponding undirected graph where the directions of the edges are ignored (i.e., the ordered pairs (u,v) are treated as unordered pairs {u,v}).  The degree of a node in an undirected graph is simply the number of incident edges, and a path between two nodes u and v is a sequence of unordered pairs that connects u and v.  Note that  unlike directed graphs, paths in the undirected case are symmetric.  A connected component (or weakly connected component or WCC in contrast with an SCC) is a maximal set of nodes CV such that for every pair of vertices vi,vj in C there is a (undirected) path between vi and vj.

One of the most basic graph operations is to traverse a graph. The key idea in graph traversal is to mark the vertices as we visit them to keep track of what we have not yet seen. That is we maintain both a set of vertices that need to be explored, and a set of vertices that have already been visited.  Initially the set of vertices to be explored and the set of vertices that have been visited each contain just the start vertex.  The main processing loop is to remove a vertex from the set that needs to be explored, consider each of the vertices that are adjacent to that vertex in the graph (e.g., it's adjacency list), and for each such adjacent vertex that has not already been visited, add it both to the visited set and to the set of vertices to be explored. 

The order in which we consider vertices in the set to be explored controls the form of the search. If we use a queue, we explore the "oldest" unexplored vertices first. This is known as a breadth-first search, or BFS. Each vertex (except the starting vertex) is discovered during the processing of one other vertex, so this defines a breadth-first search tree where a child was discovered during the processing of its parent. One can think of breadth first search as building up the set of nodes reachable from a starting node u in layers. Layer 1 consists of all nodes that are pointed to by an arc from u.  Layer k consists of all nodes to which there is an arc from some vertex in layer k-1, but are not in any earlier layer. Notice that by definition, layers are disjoint. The distance (length of the shortest path) corresponds to the layer. Notice that the shortest path from u any other vertex is actually a path in this tree.

If we instead use a stack to maintain the set of vertices to be explored, we venture along a single path away from the starting vertex until there are no more undiscovered vertices in front of us. This is known as a depth-first search. Just as with breadth-first search, we can define a depth-first search tree that results from our traversal.

Of course, such a traversal (whether BFS or DFS) will only traverse a single connected component.  If a graph has multiple components, we need to do a traversal of each component.  We then end up with a set of trees, one per component, which is often referred to as a forest.

Traditionally, the diameter of a graph is the maximum over all ordered pairs of nodes (u,v) of the shortest path from u to v (i.e., the maximum pairwise distance). If the six-degrees-of-separation myth was really true, then the diameter of the graph of all "friendships" would be 6.  In practice when studying graphs, there are pairs (u,v) for which there are is no path.  Thus one can study the maximum for pairs where a path exists, or other statistics such as the average or median. People have also defined other measures of separation in graphs.

Some Properties of the Web and Other Large Graphs

Degree distributions

In-degree and out-degree in large graphs both tend to exhibit a power-law form of distribution.  Recall that for a population of individuals, each with some property, that a distribution is the number of individuals exhibiting each value of the property (i.e., the count or histogram of individuals for each value of the property).  When the property is a scalar value, a common form of distribution is the normal (or Gaussian) distribution which has a familiar bell shape, and is characterized by its mean and variance.  In contrast, a power-law distribution is a linear function when plotted on a log-log scale (that is taking both the log of the property and the log of the number of individuals).  When plotted on a standard linear plot, such a distribution appears "pinned to the axes" with a very large number of very small values, and a very small number of very large values.  That is, the distribution is well characterized by 1/xp for some p>0.  (Note that there are other so-called heavy tailed distributions that produce fairly similar results to power laws, such as lognormal distributions.  We will not consider the difference between these here, and will just consider power laws).

The intuition of a mean, or even median, that comes from distributions with a central tendency like a Gaussian is not that useful for characterizing power law distributions.

Distributions with central tendencies, such as Gaussians, often occur when measuring properties of individuals, such as height, weight, grades, etc.   In contrast, power-law distributions often occur when measuring properties of networks or large numbers of autonomous agents, such as degree distributions, incomes, word usage.

One of the first large-scale studies of the Web was done by Broder, et. al. and published in the 9th WWW in 2000.  This study examined a crawl from the AltaVista crawler, with over 200M pages and about 1.5B links.  Treating pages as nodes and hyperlinks as directed edges, they examined a number of properties of the graph of the web. 

First they considered the degree distribution for both the in-degree and out-degree of the nodes, finding both to be well fit by a power law.  The in-degree is clearly a "network process", where many individual pages each decide what pages to link to.  In a rich-get-richer kind of environment, one would naturally expect a small number of pages with many in-links and a very large number of pages with few in-links.  It is perhaps a bit more surprising that the out-degree is also relatively well fit by a power law, although with a different exponent (note the lower maximum degree).

           

Broder et. al also considered only those links that linked to another site, calling those "remote only" links, and saw a similar pattern. 

Strongly and weakly connected components

In addition to considering the pattern of node degree, it makes sense to try to characterize a large graph by considering how many connected components it has, and the distribution of their sizes.  For instance, one could postulate that the web is more or less one large SCC, that is, that it is possible to navigate from nearly any page to any other page by following a series of hyperlinks.  On the other hand, one could postulate that the web is composed of many disconnected sub-communities of pages (ie. many SCC's).   What do people think?

In fact the largest SCC is about 56 million out of the total of 203 million pages, meaning that there are paths between about a 25% of all the pages in the web.  The largest WCC however contains over 90% of the nodes.  Thus if one could browse the web by following both hyperlinks and "referring pages" (turning links around) the web would be very well connected.

The SCC and WCC sizes also exhibit power-law distributions.

               

Broder et al asked whether this widespread connectivity can simply be explained by a few nodes of large in-degree acting as "junctions" (e.g., nearly everyone links to or is linked from certain sites).  This turns out not to be the case. They tried removing all links to pages with in-degree 5 or higher, which certainly includes every well-known page on the web.  Nonetheless the graph still contains a giant weak component of size 59 million. This suggests that the connectivity of the web graph as an undirected graph is extremely resilient and does not depend on the existence of high in-degree nodes.

Structure of the Web Graph

The web graph consists of one large WCC that contains over 90% of the nodes, and within that is an SCC (where all pages are reachable from one another) that contains about 25% of the nodes.  This makes it natural to ask how one might characterize the remaining nodes outside the largest SCC, as well as those not in the WCC.  Broder et al described this with the following figure, which has now come to be called the "bow tie structure of the web".

The circle labeled SCC in the center corresponds to the largest SCC. One can pass from any node of IN through SCC to any node of OUT.  Hanging off of IN and OUT are what they called TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC.  It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming what they called a TUBE -- a passage from a portion of IN to a portion of OUT without touching SCC.  Disconnected components are those that are not in the largest WCC. 

They derived this picture by doing BFS from several hundred randomly chosen start nodes, and for each keeping track of how large the component was.  They found that when using the directed edges, either forward or backward, about half the time these searches generally found either small components, of 100 nodes or less, and the other half of the time they found very large components of 100 million nodes or more.  Using undirected edges they found that about 10% of the time there were small components (50 nodes or less) and the rest of the time very large ones of 100M nodes or more.  This led them to the notion of the IN and OUT sets, each of which together with the SCC were about 100M nodes, as well as to the presence of other smaller "tendril" sets.

Models of formation of large graphs

There are a number of models of how large graphs that exhibit power-law degree structure could form.  One of the simplest and most widely used models is commonly referred to as preferential attachment.   In this model, with some probability q, a link from node u to some node v is formed by choosing a node v uniformly at random from the set of all possible nodes.  With probability 1-q a link from node u to some node v is formed by choosing a node with probability proportional to its in degree.  In the context of the Web, some of the time a page links to another page "at random from the set of all pages" and other times it links preferentially to pages that are already important.  Obviously when q is 1 there is no preference for already popular pages, and when q is 0 there is no chance of random links.   

Friendship in LiveJournal

In Problem Set 4 you will examine the properties of the friendship graph from LiveJournal, a social networking site.  There are approximately 5 million users and 100 million friendship links (directional) on the site.   We have selected a subset of the users, and have also gathered information about "communities" that they belong to.