Keywords: rank aggregation, ranking functions, meta-search, multi-word queries, spam
Firstly, there is the question of "spam" --- devious manipulation by authors of Web pages in an attempt to achieve undeservedly high rank. No single ranking function can be trusted to perform well for all queries. A few years ago, query term frequency was the single main heuristic in ranking Web pages; since the influential work of Kleinberg [Kleinberg 1998, 1999] and Brin and Page [Brin and Page 1998], link analysis has come to be identified as a very powerful technique in ranking Web pages and other hyperlinked documents. Several other heuristics have been added, including anchor-text analysis [Chakrabarti et al. 1998], page structure (headers, etc.) analysis, the use of keyword listings and the url text itself, etc. These well-motivated heuristics exploit a wealth of information, but are often prone to manipulation by devious parties.
Secondly, in a world governed by (frequently changing) commercial interests and alliances, it is not clear that users have any form of protection against the biases/interests of individual search engines. As a case in point, note that paid placement and paid inclusion (see [Article in Search Engine Watch]) appear to be gaining popularity among search engines.
In some cases, individual ranking functions are inadequate for a more fundamental reason: the data being ranked are simply not amenable to simple ranking functions. This is the case with querying about multimedia documents, e.g. "find a document that has information about Greek islands with pictures of beautiful blue beaches." This is a problem conventionally studied in database middleware (see [Fagin 1999]). Several novel approaches have been invented for this purpose, but this problem cannot be considered well-solved by any measure. Naturally, these problems fall under the realm of rank aggregation.
Thus, our first motivation for studying rank aggregation in the context of the Web is to provide users a certain degree of robustness of search, in the face of various shortcomings and biases --- malicious or otherwise --- of individual search engines. That is, to find robust techniques for meta-search.
There is a second, very broad, set of scenarios where rank aggregation is called for. Roughly described, these are the cases where the user preference includes a variety of criteria, and the logic of classifying a document as acceptable or unacceptable is too complicated or too nebulous to encode in any simple query form. As prototypical examples, we list some cases that Web users experience frequently. Broadly, these can be classified as multi-criteria selection and word association queries. Examples of multi-criteria selection arise when trying to choose a product from a database of products, such as restaurants or travel plans. Examples of word association queries arise when a user wishes to search for a good document on a topic; the user knows a list of keywords that collectively describe the topic, but isn't sure that the best document on the topic necessarily contains all of them. (See Section 5 for specific examples of both categories.) This is a very familiar dilemma for Web search users: when we supply a list of keywords to a search engine, do we ask for documents that contain all the keywords, or do we ask for documents that contain any of the keywords? Notice that the former may produce no useful document, or too few of them, while the latter may produce an enormous list of documents where it is not clear which one to choose as the best. We propose the following natural approach to this problem:
Associations Ranking
Rank the database with respect to several small subsets of the queries, and aggregate these rankings.
The first reason is a particularly acute problem in doing meta-search: the coverage of various search engines is different; it is unlikely that all search engines will (eventually) be capable of ranking the entire collection of pages on the Web, which is growing at a very high rate. Secondly, search engines routinely limit access to about the first few hundreds of pages in their rank-ordering. This is done both to ensure the confidentiality of their ranking algorithm, and in the interest of efficiency. The issue of efficiency is also a serious bottleneck in performing rank aggregation for multi-criteria selection and word association queries.
Therefore, any method for rank aggregation for Web applications must be capable of dealing with the fact that only the top few hundred entries of each ranking are available. Of course, if there is absolutely no overlap among these entries, there isn't much any algorithm can do; the challenge is to design rank aggregation algorithms that work when there is limited but non-trivial overlap among the top few hundreds or thousands of entries in each ranking. Finally, in light of the amount of data, it is implicit that any rank aggregation method has to be computationally efficient.
Our algorithms for initial aggregation are based on two broad principles. The first principle is to achieve optimality not with respect to the Kemeny guidelines, but with respect to a different, closely related, measure, for which it is possible to find an efficient solution. The second principle is through the use of Markov chains as a means of combining partial comparison information --- derived from the individual rankings --- into a total ordering. While there is no guarantee on the quality of the output, the latter methods are extremely efficient, and usually match or outperform the first method.
We report experiments and quantitative measures of quality for the meta-search problem, and give several illustrations of our methods applied for the problems of spam resistance and word association queries.
Depending on the kind of information present in t, three situations arise:
(1) If t contains all the elements in U, then it is said to be a full list. Full lists are, in fact, total orderings (permutations) of U. For instance, if U is the set of all pages indexed by a search engine, it is easy to see that a full list emerges when we rank pages (say, with respect to a query) according to a fixed algorithm.
(2) There are situations where full lists are not convenient or even possible. For instance, let U denote the set of all Web pages in the world. Let t denote the results of a search engine in response to some fixed query. Even though the query might induce a total ordering of the pages indexed by the search engine, since the index set of the search engine is almost surely only a subset of U, we have a strict inequality |t| < |U|. In other words, there are pages in the world which are unranked by this search engine with respect to the query. Such lists that rank only some of the elements in U are called partial lists.
(3) A special case of partial lists is the following. If S is the set of all the pages indexed by a particular search engine and if t corresponds to the top 100 results of the search engine with respect to a query, clearly the pages that are not present in list t can be assumed to be ranked below 100 by the search engine. Such lists that rank only a subset of S and where it is implicit that each ranked element is above all unranked elements, are called top d lists, where d is the size of the list. A natural operation of projection will be useful. Given a list t and a subset T of the universe U, the projection of t with respect to T (denoted t|T will be a new list that contains only elements from T. Notice that if t happens to contain all the elements in T, then t|T is a full list with respect to T.
F(s, t) = S i |s(i) - t(i)|.After dividing this number by the maximum value (1/2)|S|2, one can obtain a normalized value of the footrule distance, which is always between 0 and 1. The footrule distance between two lists can be computed in linear time.
(2) The Kendall tau distance counts the number of pairwise disagreements between two lists; that is, the distance between two full lists s and t is
K(s, t) = |{(i, j) : i < j, s(i) < s(j) but t(i) > t(j)|.Dividing this number by the maximum possible value (1/2)S(S - 1) we obtain a normalized version of the Kendall distance. The Kendall distance for full lists is the "bubble sort" distance, i.e., the number of pairwise adjacent transpositions needed to transform from one list to the other. The Kendall distance between two lists of length n can be computed in n log n time using simple data structures.
The above measures are metrics and extend in a natural way to several lists. Given several full lists s, t1, ..., tk, for instance, the normalized Footrule distance of s to t1, ..., tk is given by
F(s, t1, ... tk) = (1/k) Si F(s, ti).One can define generalizations of these distance measures to partial lists. If t1, ..., tk are partial lists, let U denote the union of elements in t1, ..., tk, and let s be a full list with respect to U. Now, given s, the idea is to consider the distance between ti and the projection of s with respect to ti. Then, for instance, we have the induced footrule distance:
F(s, t1, ..., tk) = (1/k) Si F(s |ti, ti).In a similar manner, induced Kendall tau distance can be defined. Finally, we define a third notion of distance that measures the distance between a full list and a partial list on the same universe:
(3) Given one full list and a partial list, the scaled footrule distance weights contributions of elements based on the length of the lists they are present in. More formally, if s is a full list and t is a partial list, then:
SF(s, t) = Si in t | (s(i)/|s|) - (t(i)/|t|) |.We will normalize SF by dividing by |t|/2.
Note that these distances are not necessarily metrics.
To a large extent, our interpretations of experimental results will be in terms of these distance measures. While these distance measures seem natural, why these measures are good is moot. We do not delve into such discussions here; the interested reader can find such arguments in the books by Diaconis [Diaconis 1988], Critchlow [Critchlow 1985], or Marden [Marden].
Kemeny optimal aggregations have a maximum likelihood interpretation. Suppose there is an underlying "correct" ordering s of S, and each order t1, ..., tk is obtained from s by swapping two elements with some probability less than 1/2. Thus, the t's are "noisy" versions of s. A Kemeny optimal aggregation of t1, ..., tk, is one that is maximally likely to have produced the t's (it need not be unique) [Young 1988]. Viewed differently, Kemeny optimal aggregation has the property of eliminating noise from various different ranking schemes. Furthermore, Kemeny optimal aggregations are essentially the only ones that simultaneously satisfy natural and important properties of rank aggregation functions, called neutrality and consistency in the social choice literature, and the so-called Condorcet property [Young and Levenglick 1978]. Indeed, Kemeny optimal aggregations satisfy the extended Condorcet criterion. In Section 3 we establish a strong connection between satisfaction of the extended Condorcet criterion and fighting search engine "spam."
Given that Kemeny optimal aggregation is useful, but computationally hard, how do we compute it? The following relation shows that Kendall distance can be approximated very well via the Spearman footrule distance.
Proposition 1 [Diaconis-Graham]This leads us to the problem of footrule optimal aggregation. This is the same as before, except that the optimizing criterion is the footrule distance. In Section 4 we exhibit a polynomial time algorithm to compute optimal footrule aggregation (scaled footrule aggregation for partial lists). Therefore we have:
For any two full lists s, t, K(s, t) < F(s, t) < 2K(s, t).
Proposition 2Later, in Section 4, we develop rank aggregation methods that do not optimize any obvious criteria, but turn out to be very effective in practice.
If s is the Kemeny optimal aggregation of full lists t1, ..., tk, and s' optimizes the footrule aggregation, then K(s', t1, ..., tk) < 2K(s, t1, ..., tk).
A bipartite graph G = (U, V, E) consists of two disjoint sets of nodes U, V such that each edge e in E has one node from U and the other node from V. A bipartite graph is complete if each node in U is connected to every node in V. A matching is a subset of edges such that for each edge in the matching, there is no other edge that shares a node with it. A maximum matching is a matching of largest cardinality. A weighted graph is a graph with a (non-negative) weight for every edge e. Given a weighted graph, the minimum weight maximum matching is the maximum matching with minimum weight. The minimum weight maximum matching problem for bipartite graphs can be solved in time O(n2.5), where n is the number of nodes.
A directed graph consists of nodes and edges, but this time an edge is an ordered pair of nodes (u, v), representing a connection from u to v. A directed path is said to exist from u to v if there is a sequence of nodes u = w0, ..., wk = v such that (wi, wi+1 is an edge, for all i = 0, ..., k - 1. A directed cycle is a non-trivial directed path from a node to itself. A strongly connected component of a graph is a set of nodes such that for every pair of nodes in the component, there is a directed path from one to the other. A directed acyclic graph (DAG) is a directed graph with no directed cycles. In a DAG, a sink node is one with no directed path to any other node.
An important observation here is that the entries in y define a natural ordering on S. We call such an ordering the Markov chain ordering of M. A technical point to note while using Markov chains for ranking is the following. A Markov chain M defines a weighted graph with n nodes such that the weight on edge (u, v) is given by Muv. The strongly connected components of this graph form a DAG. If this DAG has a sink node, then the stationary distribution of the chain will be entirely concentrated in the strongly connected component corresponding to the sink node. In this case, we only obtain an ordering of the alternatives present in this component; if this happens, the natural extended procedure is to remove these states from the chain and repeat the process to rank the remaining nodes. Of course, if this component has sufficiently many alternatives, one may stop the aggregation process and output a partial list containing some of the best alternatives. If the DAG of connected components is (weakly) connected and has more than one sink node, then we will obtain two or more clusters of alternatives, which we could sort by the total probability mass of the components. If the DAG has several weakly connected components, we will obtain incomparable clusters of alternatives. Thus, when we refer to a Markov chain ordering, we refer to the ordering obtained by this extended procedure.
Intuitively, a search engine has been spammed by a page in its index, on a given query, if it ranks the page "too highly" with respect to other pages in the index, in the view of a "typical" user. Indeed, in accord with this intuition, search engines are both rated (see [Media Metrix]) and trained by human evaluators. This approach to defining spam: (1) permits an author to raise the rank of her page by improving the content; (2) puts ground truth about the relative value of pages into the purview of the users --- in other words, the definition does not assume the existence of an absolute ordering that yields the "true" relative value of a pair of pages on a query; (3) does not assume unanimity of users' opinions or consistency among the opinions of a single user; and (4) suggests some natural ways to automate training of engines to incorporate useful biases, such as geographic bias.
We believe that reliance on evaluators in defining spam is unavoidable. (If the evaluators are human, the typical scenario during the design and training of search engines, then the eventual product will incorporate the biases of the training evaluators.) We model the evaluators by the search engine ranking functions. That is, we make the simplifying assumption that for any pair of pages, the relative ordering by the majority of the search engines comparing them is the same as the relative ordering by the majority of the evaluators. Our intuition is that if a page spams all or even most search engines for a particular query, then no combination of these search engines can defeat the spam. This is reasonable: Fix a query; if for some pair of pages a majority of the engines is spammed, then the aggregation function is working with overly bad data --- garbage in, garbage out. On the other hand, if a page spams strictly fewer than half the search engines, then a majority of the search engines will prefer a "good" page to a spam page. In other words, under this definition of spam, the spam pages are the Condorcet losers, and will occupy the bottom partition of any aggregated ranking that satisfies the extended Condorcet criterion. Similarly, assuming that good pages are preferred by the majority to mediocre ones, these will be the Condorcet winners, and will therefore be ranked highly.
Many of the existing aggregation methods (see Section 4) do not ensure the election of the Condorcet winner, should one exist. Our aim is to obtain a simple method of modifying any initial aggregation of input lists so that the Condorcet losers (spam) will be pushed to the bottom of the ranking during this process. This procedure is called local Kemenization and is described next.
A full list p is a locally Kemeny optimal aggregation of partial lists t1, t2, ..., tk, if there is no full list p' that can be obtained from p by performing a single transposition of an adjacent pair of elements and for which K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk). In other words, it is impossible to reduce the total distance to the p's by flipping an adjacent pair.
Every Kemeny optimal aggregation is also locally Kemeny optimal, but the converse is false. Nevertheless, we show that a locally Kemeny optimal aggregation satisfies the extended Condorcet property and can be computed (see Appendix A) in time O(kn log n).
We have discussed the value of the extended Condorcet criterion in increasing resistance to search engine spam and in ensuring that elements in the top partitions remain highly ranked. However, specific aggregation techniques may add considerable value beyond simple satisfaction of this criterion; in particular, they may produce good rankings of alternatives within a given partition (as noted above, the extended Condorcet criterion gives no guidance within a partition). We now show how, using any initial aggregation m of partial lists t1, t2, ..., tk --- one that is not necessarily Condorcet --- we can efficiently construct a locally Kemeny optimal aggregation of the t's that is in a well-defined sense maximally consistent with m. For example, if the t's are full lists then m could be the Borda ordering on the alternatives (see Section 4.1 for Borda's method). Even if a Condorcet winner exists, the Borda ordering may not rank it first. However, by applying our "local Kemenization" procedure (described below), we can obtain a ranking that is maximally consistent with the Borda ordering but in which the Condorcet winners are at the top of the list.
A local Kemenization (LK) of a full list m with respect to t1, ..., tk is a procedure that computes a locally Kemeny optimal aggregation of t1, ..., tk that is (in a precise sense) maximally consistent with m. Intuitively, this approach also preserves the strengths of the initial aggregation m. Thus:
(1) the Condorcet losers receive low rank, while the Condorcet winners receive high rank (this follows from local Kemeny optimality)
(2) the result disagrees with m on the order of any given pair (i,j) of elements only if a majority of those t's expressing opinions disagrees with m on (i,j).
(3) for every d between 1 and |m|, the length d prefix of the output is a local Kemenization of the top d elements in m.
Thus, if m is an initial meta-search result, and we have some faith that the top, say, 100 elements of m contain enough good pages, then we can build a locally Kemeny optimal aggregation of the projections of the t's onto the top 100 elements in m.
The local Kemenization procedure is a simple inductive construction. Without loss of generality, let m = (1, 2, ..., |m|). Assume inductively for that we have constructed p, a local Kemenization of the projection of the t's onto the elements 1, ..., l-1. Insert element l into the lowest-ranked "permissible" position in p: just below the lowest-ranked element y in p such that (a) no majority among the (original) t's prefers x to y and (b) for all successors z of y in p there is a majority that prefers x to z. In other words, we try to insert x at the end (bottom) of the list p; we bubble it up toward the top of the list as long as a majority of the t's insists that we do.
A rigorous treatment of local Kemeny optimality and local Kemenization is given in Appendix A, where we also show that the local Kemenization of an aggregation is unique. On the strength of these results we suggest the following general approach to rank aggregation:
Given t1, ..., tk, use your favorite aggregation method to obtain a full list m.
Output the (unique) local Kemenization of m with respect to t1, ..., tk
Full lists. Given full lists t1, ..., tk, for each candidate c in S and list ti, Borda's method first assigns a score Bi(c) = the number of candidates ranked below c in ti, and the total Borda score B(c) is defined as Si Bi(c). The candidates are then sorted in decreasing order of total Borda score.
We remark that Borda's method can be thought of as assigning a k-element position vector to each candidate (the positions of the candidate in the k lists), and sorting the candidates by the L1 norm of these vectors. Of course, there are plenty of other possibilities with such position vectors: sorting by Lp norms for p > 1, sorting by the median of the k values, sorting by the geometric mean of the k values, etc. This intuition leads us to several Markov chain based approaches, described in Section 4.3.
Partial lists. It has been proposed (e.g., in a recent article that appeared in The Economist [Saari, 2000]) that the right way to extend Borda to partial lists is by apportioning all the excess scores equally among all unranked candidates. This idea stems from the goal of being unbiased; however, it is easy to show that for any method of assigning scores to unranked candidates, there are partial information cases in which undesirable outcomes occur.
Full lists. Footrule optimal aggregation is related to the median of the values in a position vector:
Proposition 3Now we obtain an algorithm for footrule optimal aggregation via the following proposition:
Given full lists t1, ..., tk, if the median positions of the candidates in the lists form a permutation, then this permutation is a footrule optimal aggregation.
Proposition 4
Footrule optimal aggregation of full lists can be computed in polynomial time, specifically, the time to find a minimum cost perfect matching in a bipartite graph.
Proof (Sketch):Partial lists. The computation of a footrule-optimal aggregation for partial lists is more problematic. In fact, it can be shown to be equivalent to the NP-hard problem of computing the minimum number of edges to delete to convert a directed graph into a DAG.
Let the union of t1, ..., tk be S with n elements. Now, we define a a weighted complete bipartite graph (C, P, W) as follows. The first set of nodes C = {1, ..., n} denotes the set of elements to be ranked (pages). The second set of nodes P = {1, ..., n} denotes the n available positions. The weight W(c, p) is the total footrule distance (from the ti's) of a ranking that places element c at position p, given by W(c, p) = Si |ti(c) - p|. It can be shown that a permutation minimizing the total footrule distance to the ti's is given by a minimum cost perfect matching in the bipartite graph.
Keeping in mind that footrule optimal aggregation for full lists can be recast as a minimum cost bipartite matching problem, we now describe a method that retains the computational advantages of the full list case, and is reasonably close to it in spirit. We define the bipartite graph as before, except that the weights are defined differently. The weight W(c, p) is the scaled footrule distance (from the ti's) of a ranking that places element c at position p, given by W(c, p) = Si | (ti(c) /|ti|) - (p/n) |. As before, we can solve the minimum cost maximum matching problem on this bipartite graph to obtain the footrule aggregation algorithm for partial lists. We called this method the scaled footrule aggregation (SFO).
Handling partial lists and top d lists
Rather than require every pair of pages (candidates)
i and j to be compared by every search engine (voter), we
may now use the the available comparisons between i and j
to determine the transition probability between i and j,
and exploit the connectivity of the chain to (transitively) "infer" comparison
outcomes between pairs that were not explicitly ranked by any of the search
engines. The intuition is that Markov chains provide a more holistic viewpoint
of comparing all n candidates against each other --- significantly
more meaningful than ad hoc and local inferences like "if a majority prefer
A to B and a majority prefer B to C, then A should be better than C."
Handling uneven comparisons
If a Web page P appears in the bottom half
of about 70% of the lists, and is ranked Number 1 by the other 30%, how
important is the quality of the pages that appear on the latter 30% of
the lists? If these pages all appear near the bottom on the first set of
70% of the lists and the winners in these lists were not known to the other
30% of the search engines that ranked P Number 1, then perhaps we
shouldn't consider P too seriously. In other words, if we view each
list as a tournament within a league, we should take into account the strength
of the schedule of matches played by each player. The Markov chain solutions
we discuss are similar in spirit to the approaches considered in the mathematical
community for this problem (eigenvectors of linear maps, fixed points of
nonlinear maps, etc.).
Enhancements of other heuristics
Heuristics for combining rankings are motivated
by some underlying principle. For example, Borda's method is based on the
idea "more wins is better." This gives some figure of merit for each candidate.
It is natural to extend this and say "more wins against good players
is even better," and so on, and iteratively refine the ordering produced
by a heuristic. In the context of Web searching, the HITS algorithm of
Kleinberg [Kleinberg 1998, 1999] and the
PageRank algorithm of Brin and Page [Brin and Page 1998]
are motivated by similar considerations. As we will see, some of the chains
we propose are natural extensions (in a precise sense) of Borda's method,
sorting by geometric mean, and sorting by majority.
Computational efficiency
In general, setting up one of these Markov chains
and determining its stationary probability distribution takes about q(n2k
+ n3) time. However, in practice, if we explicitly compute
the transition matrix in O(n2k) time, a
few iterations of the power method will allow us to compute the stationary
distribution. In fact, we suggest an even faster method for practical purposes.
For all of the chains that we propose, with about O(nk) (linear
in input size) time for preprocessing, it is usually possible to simulate
one step of the chain in O(k) time; thus by simulating the
Markov chain for about O(n) steps, we should be able to sample
from the stationary distribution pretty effectively. This is usually sufficient
to identify the top few candidates in the stationary distribution in O(nk)
time, perhaps considerably faster in practice.
We now propose some specific Markov chains, denoted MC1, MC2, MC4 and MC4. For each of these chains, we specify the transition matrix and give some intuition as to why such a definition is reasonable. In all cases, the state space is the union of the sets of pages ranked by various search engines.
MC1
If the current state is page P, then the next state is chosen uniformly from the multiset of all pages that were ranked higher than (or equal to) P by some search engine that ranked P, that is, from the multiset of all pages Q such that ti(Q) at most ti(P). The main idea is that in each step, we move from the current page to a better page, allowing about 1/j probability of staying in the same page, where j is roughly the average rank of the current page.
MC2
If the current state is page P, then the next state is chosen by first picking a ranking t uniformly from all the partial lists t1, ..., tk containing P, then picking a page Q uniformly from the set of all pages Q such that t(Q) is at most t(P).
This chain takes into account the fact that we have several lists of rankings, not just a collection of pairwise comparisons among the pages. As a consequence, MC2 is arguably the most representative of minority viewpoints of sufficient statistical significance; it also protects specialist views. In fact, MC2 generalizes the geometric mean analogue of Borda's method. For full lists, if the initial state is chosen uniformly at random, after one step of MC2, the distribution induced on its states produces a ranking of the pages such that P is ranked higher than (preferred to) Q iff the geometric mean of the ranks of P is lower than the geometric mean of the ranks of Q.
MC3
If the current state is page P, then the next state is chosen as follows: first pick a ranking t uniformly from all the partial lists t1, ..., tk containing P, then uniformly pick a page Q that was ranked by t. If t(Q) < t(P) then go to Q, else stay in P.
This chain is a generalization of Borda method. For full lists, if the initial state is chosen uniformly at random, after one step of MC3, the distribution induced on its states produces a ranking of the pages such that P is ranked higher than Q iff the Borda score of P is higher than the Borda score of Q. This is natural, considering that in any state P, the probability of staying in P is roughly the fraction of pairwise contests (with all other pages) that P won, which is a very Borda-like measure.
MC4There are examples that differentiate the behavior of these chains. One can also show that the Markov ordering implied by these chains need not satisfy the extended Condorcet principle.
If the current state is page P, then the next state is chosen as follows: first pick a page Q uniformly from the union of all pages ranked by the search engines. If t(Q) < t(P) for a majority of the lists t that ranked both P and Q, then go to Q, else stay in P.
This chain generalizes Copeland's suggestion of sorting the candidates by the number of pairwise majority contests they have won [Copeland 1951].
The actual success of a meta-search engine directly depends on the aggregation technique underlying it. Since the techniques proposed in Section 4 work on partial lists and top d lists, they can be applied to build a meta-search engine. The idea is simple: given a query, obtain the top (say) 100 results from many search engines, apply the rank aggregation function with the universe being the union of pages returned by the search engines, and return the top (say) 100 results of the aggregation. We illustrate this scheme in Section 6.2.1 and examine the performance of our methods.
One common technique for indexing is to construct a ranking function. With respect to a query, a ranking function can operate in two ways: (i) it can give an absolute score to a document indicating the relevance of the document to the query (score-based) or (ii) it can take two documents and rank order them with respect to the query (comparison-based). Based on the underlying methodology used, many competing ranking functions can be obtained. For instance, term-counting yields a simple ranking function. Another ranking function might be the consequence of applying the vector-space model and an appropriate distance measure to the document collection. Yet other ranking functions might be the ones implied by PageRank [Brin and Page 1998] and Clever [Kleinberg 1998, 1999], [Chakrabarti et al. 1998]. It is important to note that if the ranking function is score-based, the ordering implied by the scores makes more sense than the actual scores themselves, which are often either meaningless or inaccurate. And, for a particular ranking function and a query, it is often easier to return the top few documents relevant to the query than to rank the entire document base.
Given many ranking functions for a single document base, we have the case of top d lists, where d is the number of documents returned by each of the ranking functions. Our techniques can be applied to obtain a good aggregation of these ranking functions. Notice that we give equal weight to all the ranking functions, but this could be easily modified if necessary.
Such rank aggregation may be useful in other domains as well: many airline reservation systems suffer from lack of ability to express preferences. If the system is flexible enough to let the user specify various preference criteria (travel dates/times, window/aisle seating, number of stops, frequent-flier preferences, refundable/non-refundable nature of ticket purchase, and of course, price), it can rank the available flight plans based on each of the criteria, and apply rank aggregation methods to give better quality results to the user. Similarly, in the choice of restaurants from a restaurant database, users might rank restaurants based on several different criteria (cuisine, driving distance, ambiance, star-rating, dollar-rating, etc.). In both examples, users might be willing to compromise one or more of these criteria, provided there is a clear benefit with respect to the others. In fact, very often there is not even a clear order of importance among the criteria. A good aggregation function is a very effective way to make a selection in such cases.
Many of these tasks can be accomplished by a complicated Boolean query (via advanced query), but we feel that it is unreasonable to expect an average Web user to subscribe to this. Note also that simply asking for documents that contain as many of the keywords as possible is not necessarily a good solution: the best document on the topic might have only three of the keywords, while a spam document might well have four keywords. As a specific motivating example, consider searching for the job of a software engineer from an on-line job database. The user lists a number of skills and a number of potential keywords in the job description, for example, "Silicon Valley C++ Java CORBA TCP-IP algorithms start-up pre-IPO stock options". It is clear that the "AND" rule might produce no document, and the "OR" rule is equally disastrous.
We propose a word association scheme to handle these situations. Given a set of query words w1, ..., wl, we propose to construct several (say, k) sub-queries which are subsets of the original query words. We query the search engine with these k sub-queries (using the AND semantics) and obtain k top d (say, d = 100) results for each of the sub-queries. Then, we can use the methods in Section 3 and Section 4 to obtain a locally Kemenized aggregation of the top d lists and output this as the final answer corresponding to the multi-word query. By examples, we illustrate this application in Section 6.2.3.
Where do the words come from?
One way to obtain such a set of query words is
to prompt the user to associate as many terms as possible with the desired
response. This might be too taxing on a typical user. A less demanding
way is to let the user highlight some words in a current document; the
search term are then extracted from the "anchor text," i.e., the words
around the selected words.
We use the following seven search engines: Altavista (AV), Alltheweb (AW), Excite (EX), Google (GG), Hotbot (HB), Lycos (LY), and Northernlight (NL). For each of the search engines, we focused only on the top 100 queries. Our distance measurements are with respect to union of the top 100 results from these search engines.
For measuring the performance of our methods (first experiment), we selected the following 38 general queries (these queries are a superset of the 28 queries used in several earlier papers [Bharat and Henzinger 1998], [Chakrabarti et al. 1998]). For the second experiment, we pick some queries that were spammed in popular search engines. For the third experiment, we pick multi-word queries that perform poorly with existing search engines. Our notion of two URLs being identical is purely syntactic (up to some canonical form); we do not use the content of page to determine if two URLs are identical.
affirmative action, alcoholism, amusement parks, architecture, bicycling, blues, cheese, citrus groves, classical guitar, computer vision, cruises, Death Valley, field hockey, gardening, graphic design, Gulf war, HIV, java, Lipari, lyme disease, mutual funds, National parks, parallel architecture, Penelope Fitzgerald, recycling cans, rock climbing, San Francisco, Shakespeare, stamp collecting, sushi, table tennis, telecommuting, Thailand tourism, vintage cars, volcano, zen buddhism, and Zener.The average intersection in the top 100 for any pair of search engines is given in Table 1, which shows the number of pages as a function of number of search engines in which they are present. For instance, the fourth column in the table means that 27.231 pages (on average) were present in exactly three of the search engine results. The second column indicates that around 284 pages were present in only one search engine while the last column indicates that less than 2 pages were present in all the search engines.
|
|
|
|
|
In general, local Kemenization seems to improve around 1-3% in terms of the distance measures. It can be shown formally that local Kemenization never does worse in the sense that the Kendall distance never deteriorates after local Kemenization. Interestingly, this seems to be true even for footrule and scaled footrule distances (although we don't know if this true always). We conclude that local Kemenization procedure is always worth applying: either the improvement is large and if not, then the time spent is small.
Examining the results in Section 6.2.2, we see that SFO and MC4 are quite effective in combating spam. While we do not claim that our methods completely eliminate spam, our study shows that they reduce spam in general.
The results in Section 6.2.3 shows that our technique of word association combined with rank aggregation methods can improve the quality of search results for multi-word queries. In each of the three examples presented, Google typically produced a total of only around 10--15 pages, and the top 5 results were often very poor (a direct consequence of the AND semantics employed for a long list). In sharp contrast, the URLs produced by the rank aggregation methods turned out to contain a wealth of information about the topic of the query.
Further work involves trying to obtain a qualitative understanding of why the Markov chain methods perform very well. Also, it will be interesting to measure the efficacy of our methods on a document base with several competing ranking functions. Finally, this work originated in conversations with Helen Nissenbaum on bias in searching. A formal treatment of bias seems difficult but alluring.
Definition 5Note that the above definition is not equivalent to requiring that no flipping of any (not necessarily adjacent) pair will decrease the sum of the distances to the t's.
A permutation p is a locally Kemeny optimal aggregation of partial lists t1, t2, ..., tk, if there is no permutation p' that can be obtained from p by performing a single transposition of an adjacent pair of elements and for which K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk). In other words, it is impossible to reduce the total distance to the t's by flipping an adjacent pair.
Example 1 p
= (1,2,3), t1
= (1,2), t2
= (2,3), t3
= t4 =
t5 = (3,1).
We have that p
satisfies Definition 5, K(p,
t1, t2,
..., t5)=
3, but transposing 1 and 3 decreases the sum to 2.
Every Kemeny optimal permutation is also locally Kemeny optimal, but the converse does not hold (cf. Example 1). Furthermore, a locally Kemeny optimal permutation is not necessarily a good approximation for the optimal. For example, if the t's are as in Example 1, the number of (3,1) partial lists is very large, and there is only one occurrence of each of the partial lists (1,2) and (2,3), then (1,2,3) is still locally Kemeny optimal, but the ratio (of the K-distances) to the optimal may be arbitrarily large. Nevertheless, the important observations, proved next, are that a locally Kemeny optimal aggregation satisfies the extended Condorcet property and can be computed efficiently.
Convention. Recall our convention that p ranks x above y (i.e., prefers x to y) whenever p(x) < p(y).
Lemma 6
Let p, a permutation on alternatives {1, ..., n}, be a locally Kemeny optimal aggregation for partial lists t1, t2, ..., tk. Then p satisfies the extended Condorcet criterion with respect to t1, t2, ..., tk.
ProofThe set t1, t2, ..., tk of partial lists defines a directed majority graph G on the n alternatives, with an edge (x,y) from x to y if a majority of the t's that contain both x and y rank x above y.
If the lemma is false then there exist partial lists t1, t2, ..., tk, a locally Kemeny optimal aggregation p, and a partition (T,U) of the alternatives where for all a in T and b in U the majority among t1, t2, ..., tk prefers a to b, but there are c in T and d in U such that p(d) < p(c). Let (d,c) be a closest (in p) such pair. Consider the immediate successor of d in p, call it e. If e=c then c is adjacent to d in p and transposing this adjacent pair of alternatives produces a p' such that K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk), contradicting the assumption that p is a locally Kemeny optimal aggregation of the t's. If e does not equal c, then either e is in T, in which case the pair (d,e) is a closer pair in p than (d,c) and also violates the extended Condorcet condition, or e is in U, in which case (e,c) is a closer pair than (d,c) that violates the extended Condorcet condition. Both cases contradict the choice of (d,c).
Lemma 7
Locally Kemeny optimal aggregations of k lists can be computed in O( kn log n) time.
ProofWe next turn to the details of the local Kemenization procedure. Recall that the value of local Kemenization is that, given an aggregation m of several rankings, it produces a ranking p that achieves the best of both worlds: p satisfies the extended Condorcet criterion, and p is maximally consistent with m. We begin by formalizing our notion of consistency.
It is not surprising that locally Kemeny optimal aggregations can be found in polynomial time because they are only local minima. A straightforward approach requires O(n2) time; we describe a technique requiring only O( kn log n) time (generally, we are interested in the case in which k is much smaller than n).Consider the majority graph T for t1, t2, ..., tk with anti-parallel edges in the case of a tie. The problem of finding a locally Kemeny optimal aggregation of t1, t2, ..., tk is now equivalent to finding a Hamiltonian path in this graph. Due to the density of the edges it is possible to find such a path in T in O( n log n) probes to the edges of T using, for instance, a mergesort-like algorithm (the advantage of using mergesort is that the issue of inconsistent answers never arises, which simplifies the execution of the algorithm). Note that T need not be constructed explicitly. The cost of each probe is k accesses to the partial lists (to find out whether there is a majority), so the resulting complexity is O( kn log n).
Definition 8In other words, the order of two elements differs between m and p only if a majority of the t's support the change (however, consistency does not mandate a switch).
Given partial lists t1, t2, ..., tk, and a total order m we say that p is consistent with m and t1, t2, ..., tk if p(i) < p(j) implies that either
(a) m(i) < m(j)
or
(b) a majority of t1, t2, ..., tk prefer i to j (more prefer i over j than j over i, but not necessarily an absolute majority).
Note that if p is consistent with m and t1, t2, ..., tk, then
K(p, t1, t2, ..., tk) < K(m, t1, t2, ..., tk),since the only allowed changes decrease the distance to the t's.
The proof of the next lemma is straightforward from Definition 8.
Lemma 9As we will see, for any partial lists t1, t2, ..., tk, and order m there is a permutation p that is (i) locally Kemeny optimal and (ii) consistent with m. (Such a p is not necessarily unique.) We will focus particularly on m-consistent locally Kemeny optimal aggregations that, when restricted to subsets S of the most highly ranked elements in m, retain their local Kemeny optimality (Definition 10 below). This is desirable whenever we are more sure of the significance of the top results in m than the bottom ones. In this case the solution is unique (Theorem 11).
If p is consistent with m and t1, t2, ..., tk, then for any 1 < l < n, if S is the set of l alternatives ranked most highly by m, the projection of p onto S is consistent with the projections of m and t1, t2, ..., tk onto S.
Definition 10
Given partial lists t1, t2, ..., tk and a total order m on alternatives {1,2, ..., n}, we say that p is a local Kemenization of m with respect to t1, t2, ..., tk, if (1) p is consistent with m and (2) if we restrict attention to the set S consisting of the 1 < l < n most highly ranked alternatives in m, then the projection of p onto S is a locally Kemeny optimal aggregation of the projections of t1, t2, ..., tk onto S.
Theorem 12
For any partial lists t1, t2, ..., tk and order m on alternatives {1, ..., n}, there exists a unique local Kemenization of m with respect to t1, t2, ..., tk.
ProofThe algorithm suggested by this proof may take O(n2 k) time in the worst case (say a transitive tournament where m is the anti-transitive order). However, in general it requires time proportional to the Kendall distance between m and the solution. We do not expect m to be uncorrelated with the solution and therefore anticipate better performance in practice.
We prove the theorem by induction on n, the number of alternatives. The base case n=1 is trivial. Assume the statement inductively for n-1. We will prove it for n. Let x be the last (lowest-ranked) element in m and let S = {1, ..., n}-{x}. Since S is of size n-1, we have by induction that there is a unique permutation pn-1 on the elements in S satisfying the conditions of the theorem. Now insert the removed element x into the lowest-ranked "permissible" position in pn-1: just below the lowest-ranked element y such that such that (a) no majority among the (original) t's prefers x to y and (b) for all successors z of y (i.e., pn-1(y) < pn-1(z)) there is a majority that prefers x to z. Clearly no two elements of m were switched unnecessarily and the solution, p, is locally Kemeny optimal from the local Kemeny optimality of pn-1 and the majority properties. Note that the consistency condition requires that x be as low in p as local Kemeny optimality permits, so given pn-1 there is only one place in which to insert x.Suppose now that m and t1, t2, ..., tk contradict uniqueness: there are two different local Kemenizations of m with respect to t1, t2, ..., tk; call them p and p'. If we drop the last element x in m and let S be as above, then (by property (ii) of local Kemenization) the resulting permutations pn-1 and p'n-1 must each be local Kemenizations of the restrictions of the t's to S and (by property (i) and Lemma 9) they must be consistent with the restriction of m to S. By the induction hypothesis pn-1 = p'n-1 As argued above, there is only one place to insert x into this list.
We remark that computing a Kemeny optimal permutation for two lists is trivial --- simply output one of the input lists. The complexity of computing a Kemeny optimal permutation for three full lists is open; we show later in this section that this problem is reducible to the problem of finding minimum feedback edge sets on tournament graphs, which, as far as we know, is open as well.
Computing a Kemeny optimal permutation for an unbounded number of partial lists is easily seen to be NP-hard by a straightforward encoding of the feedback edge set problem: for each edge (i,j), create a partial list of two elements: i followed by j.
Theorem 11
The problem of computing a Kemeny optimal permutation for a given collection of k full lists, for even integers k >= 4, is NP-hard. The corresponding decision problem is NP-complete.
Proof
The reduction is from the feedback edge set problem. Given a directed graph G = (V,E), and an integer L >= 0, the question is whether there exists a subset F of E such that |F| < L and (V, E-F) is acyclic. Let n = |V| and m = |E|. Given G, we first produce a graph G' = (V', E') by "splitting" each edge of G into two edges; formally, let V' denote the union of V and the set {ve : e is in E} and E' = {(i, vi,j), (vi,j, j) : (i,j) in E}. The easy fact that we will use later is that G has a feedback edge set of size L if and only if G' does.Arbitrarily order all the vertices of G' so that the vertices in V receive the numbers 1, ..., n (and the vertices of the form ve receive numbers n+1, ..., n+m). Whenever we wish to refer to this ordering, we will denote it by Z. For a vertex i in V, let out(i) denote a listing of the out-neighbors of i in G' in the order prescribed by Z; similarly let in(i) denote the in-neighbors of i in G' in the order prescribed by Z. Note that none of the lists out(i) or in(i) contains any vertex from the original graph G. We now define four full lists on the set V'. For a list L, the notation Lr denotes the reversal of the list.
t1 = 1, out(1), 2, out(2), ..., n, out(n)The idea is that in t1, each vertex in V precedes all its out-neighbors in G', but the ordering of the out-neighbors of a vertex, as well as the ordering of the vertex-neighbor groups are arbitrary (according to Z). The list t2 "cancels" the effect of this arbitrariness in ordering the neighbors of a vertex and the vertex-neighbor groups, while "reinforcing" the ordering of each vertex in V above its out-neighbors in G'. Similarly, in t3 and t4, each vertex of the original vertex set V is preceded by its in-neighbors in G', with suitably arranged cancellations of the artificial ordering among the other pairs.
t2 = n, out(n)r, n-1, out(n-1)r, ..., 1, out(1)r
t3 = 1, in(1), 2, in(2), ..., n, in(n)
t4 = n, in(n)r, n-1, in(n-1)r, ..., 1, in(1)rThe main claim is that G has a feedback edge set of size L if and only if there is a permutation p such that Sr K(p, tr) < L', where L' = 2L + 2(n(n-1)/2 + m(m-1)/2 + m).
First suppose that G has a feedback edge set F of size L. It is easy to see that the set F' = {(i, vi,j) : (i,j) in F} is a feedback edge set of G', and |F'| = L. The graph (V', E'-F') is acyclic, so by topologically sorting the vertices of this graph, we obtain an ordering p of the vertices in V' such that for every (i,j) in E'-F', i is placed before j in p. We claim that p is an ordering that satisfies K(p, tr) < L'.
Note that regardless of how p was obtained, the last three terms are inevitable:
(1) for each pair i,j in V, exactly one of t1 and t2 places i above j and the other places j above i, so there is a contribution of 1 to K(p, t1) + K(p, t2); similarly, there is a contribution of 1 to K(p, t3) + K(p, t4). This accounts for the term 2n(n-1)/2.
(2) a similar argument holds for pairs ve, ve', and there are m(m-1)/2 such pairs, accounting for the term 2m(m-1)/2.
(3) a similar argument holds for pairs vi,j, j with respect to t1 and t2, and for pairs i, vi,j, with respect to t3 and t4. The total number of such pairs is 2m.
The only remaining contribution to the total distance of p from the t's comes from the i, vi,j pairs with respect to t1 and t2 (where i precedes vi,j in both lists), and the vi,j, j pairs with respect to t3 and t4 (where vi,j precedes j in both lists). Of these, a pair contributes 2 to the total Kemeny distance Sr K(p, tr) precisely if it occurs as a "back edge" with respect to the topological ordering p of the vertices of G'; since (V', E'-F') is acyclic, the total number of such back edges is at most |F'| = L.
Conversely, suppose that there exists a permutation p that achieves a total Kemeny distance of at most L' = 2L + 2(n(n-1)/2 + m(m-1)/2 + m). We have already argued (in items (1), (2), and (3) above) that p must incur a distance of 2(n(n-1)/2 + m(m-1)/2 + m). with respect to the t's, the so the only extra distance between p and the t's comes from pairs of the form i, vi,j in t1 and t2, and of the form vi,j, j in t3 and t4. Once again, each such pair contributes either 0 or 2 to the total distance. Consider the pairs that contribute 2 to the distance, and let the corresponding set of edges in E' be denoted by F'. Now, (V', E'-F') is acyclic since every edge that remains in E'-F', by definition, respects the ordering in p. Thus F' is a feedback edge set of G' of size at most L', and the set F = {(i,j) : (i, vi,j) in F' OR (vi,j, j) in F'} is a feedback edge set of G of size at most L'.
This completes the proof that computing a Kemeny optimal permutation is NP-hard even when the input consists of four full lists. The proof for the case of even k, k > 4, is a simple extension: first produce four lists as above, then add (k-4)/2 pairs of lists s, sr, where s is an arbitrary permutation. This addition clearly preserves Kemeny optimal solutions; the distance parameter is increased by an additive (k-4) (n+m)(n+m-1)/4 term.