CS 6241: Numerics for Data Science
Centrality measures
Centrality and ranking
For a given network, how important is any given node or edge? The answer to this question depends a great deal on context. Do we care about the immediate neighbors of a node (degree centrality), how close the node is to other nodes (closeness centrality), or if a node sits in a cross roads for traffic through the network (node between-ness centrality)? Do we care about the edges that are most important in keeping a graph together (edge between-ness centrality)? Do we want an egocentric measure that focuses on the graph near a particular target node (personalized PageRank)? All these different approaches lead to different — though sometimes related — measures of centrality or importance of a node or edge.
When we think about computations involving centrality measures, we care about a number of different aspects:
Does a particular notion of centrality capture the notion of importance that matters for our application? Does it make sense given how the network is constructed?
Is the measure stable, or can “small” changes to the network (for some application-dependent notion of “small”) change it dramatically?
Can we efficiently compute the centrality for the types of graphs we care about? If we are concerned with large graphs, we may seek a cheaper approximate computation instead of an exact computation; how good is the approximation?
In the rest of the lecture, we discuss these different aspects of centrality computation for several different types of centrality.
Degree centrality
The simplest measure of centrality is (weighted) node degree. In a directed graph, we distinguish between in-degree and out-degree. For example, in a social network, we might declare someone to be famous if they have many followers (a high in-degree). This measure is easy to compute, and it is easy to understand the stability to small changes in the network. Unfortunately, the degree is also a very local property, and may miss network features that matter a lot if we are concerned with the bigger picture. For example, a paper in a citation network may be heavily cited by the author and his students; but if nobody else cites the work of the group, it has little importance.
Spectral centrality measures
Many centrality measures are based on the idea of a balance or exchange between neighboring nodes, and these centrality measures typically lead to eigenvalue problems, usually involving non-negative matrices. A useful preliminary to discussing these measures is the Perron-Frobenius theorem.
In the least restrictive case, the Perron-Frobenius theorem says that every non-negative matrix has a positive real eigenvalue with absolute value at least as great as that of any other eigenvalue, and there is an associated non-negative eigenvector with some elementwise non-negative row and column eigenvectors. We typically assume the non-negative matrix is irreducible (i.e. it cannot be rewritten as a block upper triangular matrix); in this case, this largest positive eigenvalue (the Perron-Frobenius eigenvalue) has both algebraic and geometric multiplicity one and has associated positive row and column eigenvectors. The Perron eigenvalue is bounded from above and below by the largest and smallest column sums.
One of the simplest centrality measures is eigenvector centrality (also known as Bonacich centrality). The idea of eigenvector centrality is that the importance of a node is determined by whether it has important neighbors; that is, if
The HITS algorithm generalizes the notion of eigenvector centrality to the case where we have two complementary centralities: hubs and authorities. A hub is important if it is points to important authorities, and an authority is important if it points to important hubs. So if
If we scale the edge weights on the graph so that all nodes have weighted out-degree one, we are left with the stationary distribution for a Markov chain, i.e.
The eigenvector centrality vector, the stationary state of a random walk on a graph, and the dominant singular pair for a graph can all be computed by power iteration or by more rapidly convergent Krylov subspace iterations (Lanczos and Arnoldi). However, convergence may be slow if another eigenvalue is close to the Perron eigenvalue, which may happen in networks that exhibit strong clusters with weak connections between them. In this case, the slow convergence relates to the fact that the solution may be rather sensitive to small changes in the network.
The PageRank algorithm can be seen as a regularized version of computing the usual stationary distribution for a random walk on a graph. Rather than the usual model of a random walker that transitions only along edges in a graph, we consider a random surfer who usually behaves like a walker, but sometimes (with probability
Path-based centrality measures
Recall that if
An alternative is to consider not all paths in the network, but only closed paths. With the weights
Closeness centrality measures
Yet another notion of centrality involves how close nodes are to each other in the graph. If
The closeness centrality and harmonic centrality consider only the geodesic (shortest path) distances. The current flow centrality or equivalently (information centrality) instead uses the resistance distance. Let
The random walk closeness centrality is computed in terms of mean first hitting time (or mean first passage) for a random walker in the graph. Let
Recognizing that
Between-ness centrality measures
Where closeness centrality is used to find nodes that are close to many other nodes, betweenness centrality measures are used to find nodes that appear in many paths between nodes. The most common form of betweenness centrality focuses on shortest paths. As in the case of closeness centrality, the expensive part of betweenness centrality computations is an all pairs shortest path computation. A method due to Brandes combines some steps to go in time proportional to the number of nodes times the number of edges; a later approximation scheme involving intermediate pivot or landmark vertices runs more quickly (in time proportional to the number of pivots times the number of edges).
Just as we can transition from geodesic distances to flow distances or random walk distances in closeness centrality, we can do the same for betweenness centrality. Similar techniques to those we saw with closeness centrality also apply for attempting to accelerate these betweenness centrality computations.