Chapter 15

Web Search

In this chapter, we consider the problem of Web search—that is, the problem of finding the “most relevant” webpage to a given search query. This problem is related to the sponsored search problem considered in Section 10.6, but it is different in that we consider a model without payments: that is, webpages are not paying to get included in the search results. The question is thus how to design a mechanism that ensures “good search results” even if webpages try to manipulate their content to show up as a response to search queries for which they are not relevant.

For simplicity, when studying this problem, we will focus on a setting where search queries are simply keywords and assume the existence of some “naïve” scoring method for determining whether the content of a webpage is relevant with respect to that keyword—for instance, this score might depend on whether the page contains the keyword (or how many times it contains the keyword), whether it contains synonyms, spelling-corrected versions, or other related features.

Early web search algorithms relied mostly on such “content-scoring” methods to score webpages, and as a result were easy to fool by creating pages with massive lists of keywords. Modern methods for web search instead make use of the network structure of the Internet to “amplify” the naïve content scoring method to a more accurate score: Recall that we can view the Internet as a directed graph G = (V,E)—the nodes V represent webpages, and we have directed edges (v,v) between nodes (i.e., websites) v,v if and only if v links to v. Given such graph representation, we can specify the content of a webpage through a function c() that maps nodes in V to strings; we refer to the tuple (G,c) as a web structure.

15.1  Weighted Voting and PageRank

Given a naïve rule for scoring, how can we use the network structure of the Internet to improve it? The key insight is to think of the web search problem as an instance of voting.

First approach: using in-links as votes Given some web structure (G,c), begin by removing all pages deemed irrelevant (by the content-scoring algorithm) from the graph—that is, remove all pages v whose content c(v) gets a score which is below some certain threshold. Now, let a webpage’s “actual score” be the number of webpages that link to it in the reduced Internet graph:

Score(v) = in-deg(v).

This already works much better than simple content-based metrics. However, it is still easy to artificially boost a page’s relevance: one can simply create a lot of fake webpages (that contain a sufficient number of keywords to be deemed relevant) and link them to the webpage they wish to promote.

Weight-based voting The obvious problem with the previous approach is that all webpages’ “votes” are considered equally. While for the case of elections where the voters are people, this is a desirable feature, it becomes problematic in the current setting as it easy to create fake website (and thus fake “voters”). To circumvent this problem, we can assign a weight w(v) to each webpage (voter) v and define the score of a webpage as follow:

Score(v) = (v,v)Ew(v).

(Note that this bears some similarities to scoring-based voting rules, where the score of a candidate was defined as the sum of the weights assigned by voters to the candidates; the difference here is that voters are not assigning weights to the candidates, rather voters themselves are weighted.)

But how do we assign weights to webpages/voters so that relevant nodes are heavily weighted and spammers’ “fake” webpages get a lower weight?1 One approach—called Hubs–and–Authorities—introduced by Jon Kleinberg [Kle99], determines the weight of a node based on how well it is able to “predict” the relevance of a node, in the sense that the node links to the “relevant” webpages. More precisely, nodes are assigned two different scores: one score captures their ability as a predictor, and the second score captures their relevance as an answer to the search query. The relevance score is computed using the above-mentioned weighted scoring method, where the weight of a node is its prediction score. The prediction score is computed in a way that gives nodes that link to nodes with a high relevance score, a high prediction score: the actual method is simply to let the prediction score of a node be the sum of the relevance scores of the nodes it links to. Nodes that score high as predictors are called hubs and nodes that score high in terms of relevance are called authorities. For instance, pages like Yahoo or Google News are considered hubs, whereas a content provider such as the New York Times would be considered an authority.

The PageRank algorithm [PBMW98] originally used by Google considers a more symmetric approach to weighting nodes: the weight of each node simply is defined to be its (relevance) score divided by the number of outgoing links. In other words, we assign more weight to links from webpages that are considered “relevant,” which in turn are those that themselves are linked to “relevant” pages, and so forth; and the relevance score gets scaled by the number of outlinks a webpage has (or else, a relevant webpage could get an arbitrary number of “votes”). Formally, we have:

Score(v) = v(v,v)E Score(v) out-deg(v).

We can furthermore add the conditions that Score(v) 0 for all v and that vV Score(v) = 1.

But have we gained anything? To define the score of a node, we need to know the scores of all nodes linking to it, so isn’t the definition circular? It is, but just as in the case of the definition of a Nash equilibrium, what we are looking for is a fixed point of the above equations—in other words, an assignment of scores to nodes such that all of the equations hold. (Recall that the notion of a Nash equilibrium can be viewed as a strategy that itself is a fixed point of the best-response operator for all players in a game.)

Example 15.1. Consider a three-node “triangle” with nodes a,b,c and edges (a,b), (b,c), (c,a). A fixed point here would be to set all pages’ scores to 1/3—each node has one in-link and one out-link; thus,

Score(v) = v(v,v)E Score(v) out-deg(v) = 1/3 1 = 1/3

holds for all nodes, and in addition all nodes’ scores sum to 1.

An iterative PageRank procedure Just as our method of finding PNE through BRD, we can consider an analogue of BRD in the context of PageRank. Consider the following iterative procedure:

Scorei+1(v) = v(v,v)E Scorei(v) out-deg(v).

So, at each step, the score of a node v is effectively split evenly among all of its out-links. We can, intuitively, consider this process in terms of a “flow” of score, where each node starts with 1/n unit of flow and equally distributes its flow among all outward edges.

However, this gives rise to a potential problem: If a webpage does not have any out-links, all of the “flow” it carries gets lost—there is a leak in the system! Since we must ensure that the sum of all pages’ scores remains 1, we will assume that nodes without out-links at least link to themselves, and thus that all nodes in our graphs will have at least one outgoing edge.

As the following simple claim shows, once we make this restriction, the total score in the graph is preserved after each step of the procedue (and thus actually remains 1 due to the intialization step).

Claim 15.1. For every graph G = (V,E) where each node has out-degree at least 1, every i 0, it holds that

vV Scorei+1(v) = vV Scorei(v).

Additionally, if for all v V , it holds that Scorei(v) 0, then for all v V , Scorei+1(v) 0.

Proof. To prove the first part of the claim, we simply expand out the definition of Scorei+1():

vV Scorei+1(v) = vV v(v,v)E Scorei(v) out-deg(v) = (v,v)E Scorei(v) out-deg(v) = vV v(v,v)E Scorei(v) out-deg(v) = vV out-deg(v) Scorei(v) out-deg(v) = vV Scorei(v)

Note that in the third equality, we are relying on the fact that every node v V has positive out-degree. The second part of the claim follows directly from the definition of Scorei+1().

Some issues with PageRank While restricting to graphs where every node has a positive out-degree ensured that the iterative PageRank procedure is well defined, there are still some problems with PageRank, as illustrated in the next examples.

Example 15.2. Consider the three-node graph in the prior example, but this time add a node z and edges (a,z) and (z,z) (a self-loop). (See Figure 15.1 for an illustration.) At every stage of the iterative process, z’s score will increase, as it will receive “flow” from a; however, since none of z’s flow will ever leave it, over time z’s score will converge to 1, and the other nodes’ scores to 0! In addition, this is not only a property of self-loops in the graph; we can instead add to the three-node graph two nodes, z1,z2, and edges, (a,z1), (a,z2), (z1,z2), (z2,z1). As the imaginary “flow” can never leave the network of z1 and z2, eventually these nodes’ scores will converge to 1/2 and the others’ scores approach 0.

Figure 15.1: The example graphs consisting of a triangle with additional nodes. Notice that, if we think of PageRank score as a “fluid,” it can only enter the group of purple nodes and not leave; eventually, if we iterate this process, the purple nodes’ total scores will converge to 1, while the others’ will approach 0.

Example 15.3. Yet another issue with this algorithm is that the fixed points of scores in a particular graph are not uniquely defined. For instance, consider a graph consisting of two disjoint three-node triangles, a,b,c and a,b,c (see Figure 15.2). While the iterative process gives us a fixed point where all nodes have score 1/6, there exist infinitely many equally viable fixed points—any assignment of scores where a,b,c have score α and a,b,c have score β, where α,β 0, and 3α + 3β = 1, is a fixed point of this network!

Figure 15.2: Two admissible assignments of fixed point PageRank scores for the example graph consisting of two disjoint triangles. In fact, as long as the scores of every node in a triangle are identical and the sum of all nodes’ scores is 1, any such assignment is a fixed point.

15.2  Scaled PageRank

Fortunately, we can refine PageRank to deal with these problems.

𝜖-scaled PageRank Modify the PageRank procedue to include a constant parameter 𝜖 > 0, so that nodes share (1 𝜖) of their own score with their out-links, and the remainder “evaporates” and is distributed evenly among every node. So, each node gets 𝜖/n score plus whatever they receive from neighbors. (It is widely believed that Google uses 𝜖 = 1/7 in their implementation.) In other words, we are looking for a fixed point of the equations described by

Score(v) = 𝜖 n + (1 𝜖) (v,v)E Score(v) out-deg(v)

such that vScore(v) = 1, where n = |V |. As before, we can attempt to find this fixed point by the following modified iterative procedure:

It follows using essentially the same proof as for Claim 15.1, that the total score in the graph is preserved after each step of the procedue. But this time, this is only true if the sum of the score in the graph is one.

Claim 15.2. For every graph G = (V,E) where each node has out-degree at least one, every i 0, if vV Scorei(v) = 1, then vV Scorei+1(v) = 1. Additionally, if for all v V , it holds that Scorei(v) 0, then for all v V , Scorei+1(v) 0.

Proof. As in the proof of Claim 15.1, we expand out the definition of Scorei+1():

vV Scorei+1(v) = vV 𝜖 n + (1 𝜖) v(v,v)E Scorei(v) out-deg(v) = 𝜖 + (1 𝜖) (v,v)E Scorei(v) out-deg(v) = 𝜖 + (1 𝜖) vV v(v,v)E Scorei(v) out-deg(v) = 𝜖 + (1 𝜖) vV Scorei(v) = 𝜖 + (1 𝜖) = 1.

The second part of the claim again directly follows from the definition of Scorei+1().

Let us now revisit the example of the two disjoint triangles—the only fixed point is such that all nodes receive score 1/6. And, returning to the example with the node z attached to the triangle, z will no longer end up with all of the score, but instead with significantly less.

Additionally, as we now show, with this modification, scaled PageRank scores are guaranteed to exist and are uniquely defined.

Theorem 15.1. For every 𝜖 (0,1), for every graph G where each node has out-degree at least one, 𝜖-scaled PageRank scores are uniquely defined for each node.

Proof. It will be convenient to adopt linear-algebraic notation to represent the network. Let us assume (without loss of generality) that the nodes of the graph are called 1,2,,n.

  • Let r denote a vector describing the score of each node, where ri = Score(i); we say that a vector r is a probability vector if (just like a proper score vector) each ri 0 and i[n]ri = 1.
  • Let N be a matrix where Nj,i = 0 if there is no edge (j,i); if there is such an edge, let Nj,i = 1 out-deg(j).

So, given this notation, for the original (non-scaled) notion of PageRank, the fixed point we are looking for is a probability vector r such that:

ri = jNj,irj

or, equivalently, such that

r = NTr,

where NT denotes the transpose of the matrix N. Hence, what we are looking for is an eigenvector of NT with eigenvalue 1. (Recall that a vector v is an eigenvector with eigenvalue λ to a matrix M if Mv = λv.) We must find an eigenvector r that additionally satisfies the constraint that r is a probability vector. The question is whether such an eigenvector always exists?

A powerful theorem from linear algebra, originally proposed by Oskar Perron in 1907 (see e.g., [MR18]) shows that such an eigenvector r always exist, and is uniquely defined, for any matrix M satisfying the following conditions:

With regard to the second condition, note that multiplication by the matrix NT corresponds to applying a single iteration of the iterative PageRank algorithm; hence, by Claim 15.1, multiplying a probability vector by NT will always produce a probability vector, thus the second condition is satisfied.

However, the first condition is where the unscaled PageRank fails—recall that we set a lot of the entries of N to be zero. Now, consider scaled PageRank. We are now looking for r such that:

ri = 𝜖 n + (1 𝜖) jNj,irj.

But ||r||1 = 1, so, equivalently:

ri = j((1 𝜖)Nj,i + 𝜖 n)rj.

So, let us define a new matrix Ñ such that

Ñj,i = (1 𝜖)Nj,i + 𝜖 n.

Given this new matrix, the vector r we are looking for should satisfy

ri = jÑj,irj

or, equivalently, r = ÑTr. This time, however, by our construction of Ñ, we know that all of the entries of ÑT are positive reals. And because multiplication by ÑT represents an iteration of the iterative scaled PageRank process, by Claim 15.2, we know that ÑT also is a stochastic matrix. So, by Perron’s theorem above, we see that there is a unique fixed point r, which additionally is a probability vector.

Next, we show that the iterative procedure for scaled PageRank not only converges (to the unique scaled PageRank scores), but does so quickly. To determine the “distance” between the score vectors, we will employ the L1 norm: given a vector r, let ||r||1 = i|ri| denote the L1 norm of r.

Theorem 15.2. The iterative procedure described above for 𝜖-scaled PageRank converges to the unique fixed-point solution. Furthermore, at each iteration, the L1-distance to the solution decreases by at least a factor 1 𝜖.

Proof. [Advanced] Let us first redefine the iterative procedure in terms of the linear-algebraic notation we used in the previous proof, as follows:

  • r0 = 1 n1
  • ri+1 = ÑTri = (ÑT)ir0

Let r be the uniquely defined 𝜖-scaled PageRank score vector, and let Err(t) = ||rt r||1. Given a vector r = (r1,r2,,rn), let |r| denote (|r1|,|r2|,,|rn|). We show that Err(t) converges to 0 as t increases, as follows:

Err(t + 1) = ||rt+1 r|| 1 = ||ÑTrt ÑTr|| 1 = ||((1 𝜖)NTrt + 𝜖 n( rit)1) ((1 𝜖)NTr + 𝜖 n( ri)1)|| 1 = ||((1 𝜖)NTrt + 𝜖 n||rt|| 11) ((1 𝜖)NTr + 𝜖 n||r|| 11)||1 = ||((1 𝜖)NTrt + 𝜖 n1) ((1 𝜖)NTr + 𝜖 n1)||1 = (1 𝜖)||NTrt NTr|| 1 = (1 𝜖)||NT(rt r)|| 1 (1 𝜖)||NT(|rt r|)|| 1 = (1 𝜖)||(|rt r|)|| 1 = (1 𝜖)||(rt r)|| 1 = (1 𝜖)Err(t)

as desired, where the third to last equality follows from the fact that a single step of the unscaled PageRank algorithm (that is, multiplication by NT) preserves the score in the system (by Claim 15.1) and thus the L1 norm of any vector with non-negative components (and not just those with a norm of 1). (In contrast, for multiplication by ÑT, this may not necessarily hold, although as shown in Claim 15.2, it does hold for vectors with norm 1 (i.e., for probability vectors).)

An alternate interpretation of (scaled) PageRank Consider a process where you start off at a random node in the Internet graph, walk to a random one of its neighbors by traversing an outgoing edge, and randomly walk through the Internet in this manner (i.e., exploring the Internet at random by clicking on random links). Note that if we start off with a probability distribution over nodes, described as a vector r (each component of which specifies the probability of the node corresponding to the component), then the probability we obtain after one step of this process (i.e., traversing a random outgoing edge) is exactly NTr. We can now use this observation to note that the probability that you end up at a certain page after k steps is, in fact, given by k-iteration unscaled PageRank scores of the page:

So the k-iteration PageRank score of a webpage is the probability that you will end up at that page after k steps of a random walk with a random starting point. Similarly, we can think of scaled PageRank scores as characterizing a different type of random walk: At each step, with probability 𝜖, you are instead transported to a random page; otherwise, you click a random outgoing link as normal.

Personalized PageRank In the above description of scaled PageRank, we start at a uniformly random node, and each time we are randomly “transported” (with probability 𝜖) we are again set back to a uniformly random node. In an alternative version, we can attempt to obtain a “personalized score” for a certain type of user by changing the distribution from being uniformly random to something else—for instance, based on how popular each webpage is (or is estimated to be) among the user’s demographics.

15.3  Impossibility of Non-manipulable Web Search

As mentioned above, pure content-scoring methods are easy to manipulate to enhance the score of certain webpages—this type of manipulation is usually referred to as search engine optimization. The PageRank method seems to have made the process more robust, but it is not hard to see that this procedure can also be manipulated; indeed, there is a whole industry of companies—referred to as SEOs (search engine optimizers)—providing advice on search engine optimization. As a result, search providers (such as Google) are very secretive about the actual algorithm that they are using, and they frequently change the details of it. A natural question is whether such a “cat-and-mouse” game (between the search providers and the SEOs) is needed, or whether it is feasible to have a fully public web search algorithm that cannot be manipulated.

We here provide a partial answer to this question. We show that “natural” web search algorithms, of the type we have seen in this chapter, can always be manipulated. More precisely, we consider a class of canonical web search algorithms, which proceed as follows given a web structure (G = (V,E),c) and search query Q:

For instance, for the case of PageRank, the content-scoring method outputs either 0 or 1, based on whether the content was “relevant” to the search query, and the second step computes the PageRank of a node v by applying some function to the restricted graph obtained by removing all nodes with a content score of 0.

We now have the following result with respect to canonical web search algorithms: consider some web graph (G,c), some node v, and some query Q such that the web search algorithm assigns v the highest score. Then, either the content-scoring method itself is already “robust” in the sense that we cannot find some “bad/spam” web content c̃ that would get the same content score as the real content c(v), or a SEO can “manipulate” the outcome of the search algorithm as follows: the SEO adds an independent “fake” copy of the whole web to the graph—so the new Internet graph is now G G where G is an independent copy of G—except that the content of the “fake” node, v, in G corresponding to v in G is set to c̃. Since a canonical web search algorithm assigns score to a node only as a function of the graph structure and content-scores of nodes, it must assign the same score to both v and v (since from the perspective of the search algorithm, the nodes look identical; formally, this follows by switching the names of all “real” nodes to their “fake” relatives, and vice versa, and noticing that this permuted graph is identical to the original, and thus v and v must get the same scores). Thus, either of the following two things must happen, each of which is a success for the SEO:

Thus, unless the content-scoring method already is robust (which, as we argued before, seems hard to achieve), the web search algorithm can be manipulated by an SEO.

Of course, this attack is not easy to mount as it requires producing a copy of the whole web, so if we can impose a cost to obtaining IP addresses, the power of it can be limited. Another approach is to rely on some content-scoring method that relies on some human intervention: for instance, the score of a webpage may become higher if the webpage has received an SSL certificate that requires manual verification of the webpage; this, again, would make it much harder for an SEO to simply copy the web. Finally, we can bootstrap the problem by relying on human intervention (e.g., employees at Google) to score certain specific webpages, and simply giving lower scores to webpages that are not part of the connected components that have received manual scores—this prevents the above-mentioned “cloning” attacks, as the “fake web” can now be easily distinguished from the “real web” (as the real web is connected to web pages with manual scores, whereas the fake one will not be).

Notes

As mentioned above, the “Hubs–and–Authority” approach was introduced by Kleinberg [Kle99]. Google’s PageRank approach was first introduced and analyzed by Page, Brin, Motwani, and Winograd [PBMW98]. The result on manipulability of web search is new.

1It is not absurd to ask the same question for the case of standard voting—one could potentially weigh the votes of more informed voters higher.

2An equivalent definition of a stochastic matrix is that every colum of M is a probability vector.

3What we mean by this is that the final score cannot depend on the names of the nodes in G; this is formalized by requiring that the final score stays the same if we permute the names of the nodes, but keep everything else (i.e., the edges and the content score) the same with respect to the new names.