Finding Authorities and Hubs From Link Structures on the World Wide Web

Jeffrey S. Rosenthal, Department of Statistics, University of Toronto, Toronto, Ontario, Canada M5S 3G3. Supported in part by NSERC of Canada. E-mail: [email protected] Web: http://markov.utstat.toronto.edu/jeff

Abstract

Recently, there have been a number of algorithms proposed for analyzing hypertext link structure so as to determine the best "authorities" for a given topic or query. While such analysis is usually combined with content analysis, there is a sense in which some algorithms are deemed to be "more balanced" and others "more focused". We undertake a comparative study of hypertext link analysis algorithms. Guided by some experimental queries, we propose some formal criteria for evaluating and comparing link analysis algorithms.

Keywords: link analysis, web searching, hubs, authorities, SALSA, Kleinberg's algorithm, threshold, Bayesian

Introduction

In recent years, a number of papers [3,8,11,9,4] have considered the use of hypertext links to determine the value of different web pages. In particular, these papers consider the extent to which hypertext links between World Wide Web documents can be used to determine the relative authority values of these documents for various search queries.

We consider some of the previously published algorithms as well as introducing some new alternatives. One of our new algorithms is based on a Bayesian statistical approach as opposed to the more common algebraic/graph theoretic approach. While link analysis by itself cannot be expected to always provide reliable rankings, it is interesting to study various link analysis strategies in an attempt to understand inherent limitations, basic properties and "similarities" between the various methods. To this end, we offer definitions for several intuitive concepts relating to (link analysis) ranking algorithms and begin a study of these concepts.

We also provide some new (comparative) experimental studies of the performance of the various ranking algorithms. It can be seen that no method is completely safe from "topic drift", but some methods do seem to be more resistant than others. We shall see that certain methods have surprisingly similar rankings as observed in our experimental studies, however they cannot be said to be similar with regard to our formalization.

Previous Algorithms

The PageRank Algorithm

One of the earliest and most commercially successful of the efforts to use hypertext link structures in web searching is the PageRank algorithm used by Brin and Page [3] in the Google search engine [7].

The page rank of a given web page

, denoted

, is defined recursively according to the equation

where the sum is taken over all pages

which have a link to page

is the total number of links originating from page

is a number between 0 and 1, and

is a probability distribution (e.g. uniform) over all web pages.

Brin and Page [3] note that the value of

is equivalent to the limiting fraction of time spent on page

by a random walk which proceeds at each step as follows: With probability

it jumps to a sample from the distribution

, and with probability

it jumps uniformly at random to one of the pages linked from the current page. This idea is also used by Rafiei and Mendelzon [11] for computing the "reputation" of a page. Intuitively, the value of

is a measure of the importance or authority of the web page

. This ranking is used as one component of the Google search engine, to help determine how to order the pages returned by a web search query.

Kleinberg's Algorithm

Independent of Brin and Page, Kleinberg [8] proposed a more refined notion for the importance of web pages. He suggested that web page importance should depend on the search query being performed. Furthermore, each page should have a separate "authority" rating (based on the links going to the page) and "hub" rating (based on the links going from the page). Kleinberg proposed first using a text-based web search engine (such as AltaVista [1]) to get a "Root Set" consisting of a short list of web pages relevant to a given query. Second, the Root Set is augmented by pages which link to pages in the Root Set, and also pages which are linked to pages in the Root Set, to obtain a larger `Base Set" of web pages. If

is the number of pages in the final Base Set, then the data for Kleinberg's algorithm consists of an

adjacency matrix

, where $A_{ij}=1$ if there are one or more hypertext links from page

to page

, otherwise $A_{ij}=0$ .

Kleinberg's algorithm assigns to each page

an authority weight

and a hub weight

. Let

denote the vector of all authority weights, and

the vector of all hub weights. Initially both authority and hub vectors are set to

. At each iteration the operations

("in") and

("out") are performed. The operation

sets the authority vector to

. The operation

sets the hub vector to

. A normalization step is then applied, so that the vectors

and

become unit vectors in some norm. Kleinberg proves that after a sufficient number of iterations the vectors

and

converge to the principal eigenvectors of the matrices

and

, respectively. The above normalization step may be performed in various ways. Indeed, ratios such as

will converge to the same value no matter how (or if) normalization is performed.

Kleinberg's Algorithm (and some of the other algorithms we are considering) converge naturally to their principal eigenvector, i.e. to the eigenvector of their transition matrix which corresponds to the largest eigenvalue. Kleinberg [8] makes an interesting (though non-precise) claim that the subsequent non-principal eigenvectors (or their positive and negative components) are sometimes representative of "sub-communities" of web pages. It is easy to construct simple examples which show that subsequent eigenvectors sometimes are, but sometimes are not, indicative of sub-communities; we present a few indicative such examples in the full version of this paper. The significance of non-principal eigenvectors is an important topic that we intend to pursue further.

The SALSA Algorithm

An alternative algorithm, SALSA, was proposed by Lempel and Moran [9]. Like Kleinberg's algorithm, SALSA starts with a similarly constructed Base Set. It then performs a random walk by alternately (a) going uniformly to one of the pages which links to the current page, and (b) going uniformly to one of the pages linked to by the current page. The authority weights are defined to be the stationary distribution of the two-step chain doing first step (a) and then (b), while the hub weights are defined to be the stationary distribution of the two-step chain doing first step (b) and then (a).

Formally, let

denote the set of all nodes that point to

, that is, the nodes we can reach from

by following a link backwards, and let

denote the set of all nodes that we can reach from

by following a forward link. The Markov Chain for the authorities has transition probabilities

Assume for a moment that the Markov Chain is irreducible, that is, the underlying graph consists of a single connected component. The authors prove that the stationary distribution

of the Markov Chain satisfies

, where

is the set of all (backward) links. Similarly, the Markov Chain for the hubs has transition probabilities

and the stationary distribution

satisfies

, where

is the set of all (forward) links.

SALSA does not really have the same "mutually reinforcing structure" that Kleinberg's algorithm does. Indeed, since

, the relative authority of site

within a connected component is determined from local links, not from the structure of the component. (See also the discussion of locality in Section 7.) We also note that in the special case of a single component, SALSA can be viewed as a one-step truncated version of Kleinberg's algorithm. That is, in the first iteration of Kleinberg's algorithm, if we perform the

operation first, the authority weights are set to

, where

is the vector of all ones. If we normalize in the

norm, then $a_i = frac{vert B(i)vert}{vert Bvert}$ , which is the stationary distribution of the SALSA algorithm. A similar observation can be made for the hub weights.

If the underlying graph of the Base Set consists of more than one component, then the SALSA algorithm selects a starting point uniformly at random, and performs a random walk within the connected component that contains that node. Formally, let

be a component that contains node

, let

denote the number of nodes in the component, and

the set of (backward) links in component

. Then, the authority weight of node

Led astray by the simplifying assumption of a single component, we considered a simplified version of the SALSA algorithm where the authority weight of a node is the ratio

. This corresponds to the case that the starting point for the random walk is chosen with probability proportional to the "popularity" of the node, that is, the number of links that point to this node. We will refer to this variation of the SALSA algorithm as pSALSA (popularity SALSA)

. We will consider the original SALSA algorithm as defined in [9] in the full version of our paper.

An interesting generalization of the SALSA algorithm is considered by Rafiei and Mendelzon [11]. They propose an algorithm for computing reputations that is a hybrid of the SALSA algorithm, and the PageRank algorithm. At each step, with probability

, the Rafiei and Mendelzon algorithm jumps to a page of the collection chosen uniformly at random, and with probability

it performs a SALSA step.

The PHITS Algorithm

Cohn and Chang [4] propose a statistical hubs and authorities algorithm, which they call the PHITS Algorithm. They propose a probabilistic model in which a citation

of a document

is caused by a latent "factor" or "topic",

. It is postulated that there are conditional distributions

of a citation

given a factor

, and also conditional distributions

of a factor

given a document

. In terms of these conditional distributions, they produce a likelihood function.

Cohn and Chang then propose using the EM Algorithm of Dempster et al. [5] to assign the unknown conditional probabilities so as to maximize this likelihood function

, and thus best "explain" the proposed data. Their algorithm requires specifying in advance the number of factors

to be considered. Furthermore, it is possible that their EM Algorithm could get "stuck" in a local maximum, without converging to the true global maximum.

Random Walks and the Kleinberg Algorithm

The fact that the output of the first (half) step of the Kleinberg algorithm can be seen as the stationary distribution of a certain random walk on the underlying graph, poses the natural question of whether other intermediary results of Kleinberg's algorithm (and ultimately the output of the algorithm itself) can also be seen as the stationary distribution of a random walk. We show that this is indeed the case.

$begin{proof}% latex2html id marker 106We first introduce the following nota...... to $M_{(BF)^n}$, and $M^h_n$ to$M_{(FB)^n}$ concludes the proof.end{proof}$

Some modifications to the Kleinberg and SALSA Algorithms

While Kleinberg's algorithm has some very desirable properties, it also has its limitations. One potential problem is the possibility of severe "topic drift". Roughly, Kleinberg's algorithm converges to the most "tightly-knit" community within the Base Set. It is possible that this tightly-knit community will have little or nothing to do with the proposed query topic.

A striking example of this phenomenon is provided by Cohn and Chang ([4], p. 6). They use Kleinberg's Algorithm with the search term "jaguar" (an example query suggested by Kleinberg [8]), and converge to a collection of sites about the city of Cincinnati! They determine that the cause of this is a large number of on-line newspaper articles in the Cincinnati Enquirer which discuss the Jacksonville Jaguars football team, and all link to the same standard Cincinnati Enquirer service pages. Interestingly, in a preliminary experiment with the query term "abortion" (another example query suggested by Kleinberg [8]), we also found the Kleinberg Algorithm converging to a collection of web pages about the city of Cincinnati!

Now, in both these cases, we believe it is possible to eliminate such errant behavior through more careful selection of the Base Set, and more careful elimination of intra-domain hypertext links. Nevertheless, we do feel that these examples point to a certain "instability" of Kleinberg's Algorithm.

The Hub-Averaging-Kleinberg Algorithm

We propose here a small modification of Kleinberg's algorithm to help remedy the above-mentioned instability. For motivation, consider the following. Suppose there are

authority pages, and

hub pages, with

large. The first

hubs link only to the first authority, while the final hub links to all

authorities. In such a set-up, we would expect the first authority to be considered much more authoritative than all the others, and Kleinberg's algorithm does indeed do this. On the other hand, it seems that the final hub should be worse than the others, since in addition to linking to a good authority (the first authority), it also links to many bad authorities. However, according to Kleinberg's algorithm, it is the best hub, 5pt because linking to more things can only improve your hub rating.

Inspired by such considerations, we propose an algorithm which is a "hybrid" of the Kleinberg and SALSA algorithms. Namely, it does the authority rating updates

just like Kleinberg (i.e., giving each authority a rating equal to the sum of the hub ratings of all the pages that link to it), but does the hub rating updates

by instead giving each hub a rating equal to the average of the authority ratings of all the pages that it links to. With this modified "Hub-Averaging" algorithm, a hub is better if it links to only good authorities, rather than linking to both good and bad authorities.

The Threshold-Kleinberg Algorithms

We propose two different "threshold" modifications to Kleinberg's Algorithm. The first modification, Hub-Threshold, is applied to the in-step

. When computing the authority weight of the $i^{rm th}$ page, this algorithm does not take into account all hubs that point to page

. It only counts those hubs whose hub weight is at least the average hub weight over all the hubs that point to page

, computed using the current hub weights for the nodes. This corresponds to saying that a site should not be considered a good authority simply because a lot of very poor hubs point to it.

The second modification, Authority-Threshold, is applied to the out-step

. When computing the hub weight of the $i^{rm th}$ page, this algorithm does not take into account all authorities pointed to by page

. It only counts those authorities which are among the top

authorities, judging by current authority values. The value of

is passed as a parameter to the algorithm. This corresponds to saying that a site should not be considered a good hub simply because it points to a number of "acceptable" authorities; rather, to be considered a good hub the site must point to some of the best authorities. This is inspired partially by the fact that, in most web searches, a user only visits the top few authorities.

We also consider a Full-Threshold algorithm, which makes both the Hub-Threshold and Authority-Threshold modifications to Kleinberg's Algorithm.

The Breadth-First-Search Algorithm: A Normalized

-step Variant

When the pSALSA algorithm computes the authority weight of a page, it takes into account only the popularity of this page within its immediate neighborhood, disregarding the rest of the graph. On the other hand, the Kleinberg algorithm considers the whole graph, taking into account more the structure of the graph around the node, than the popularity of that node in the graph. Specifically, after

steps, the authority weight of the $i^{rm th}$ authority is

, where

is the number of

paths that leave node

. Another way to think of this is that the contribution of a node

to the weight of

is equal to the number of

paths that go from

. Therefore, if a small bipartite component intercepts the path between node

and

, the contribution of node

will increase exponentially fast. This may not always be desirable, especially if the bipartite component is not representative of the query.

We propose the Breadth-First-Search (BFS) algorithm, as a generalization of the pSALSA algorithm, and a restriction of the Kleinberg algorithm. The BFS algorithm extends the idea of popularity that appears in pSALSA from a one link neighborhood to an

-link neighborhood. However, instead of considering the number of

paths that leave

, it considers the number of

neighbors of node

. We let

denote the set of nodes that can be reached from

by following a

path. The contribution of node

to the weight of node

depends on the distance of the node

from

. We adopt an exponentially decreasing weighting scheme. Therefore, the weight of node

is determined as follows:

The algorithm starts from node

, and visits its neighbors in BFS order. At each iteration it takes a Backward or a Forward step (depending on whether it is an odd, or an even iteration), and it includes the new nodes it encounters. The weight factors are updated accordingly. Note that each node is considered only once, when it is first encountered by the algorithm.

A Bayesian Algorithm

A different type of algorithm is given by a fully Bayesian statistical approach to authorities and hubs. Suppose there are

hubs and

authorities (which could be the same set). We suppose that each hub

has an (unknown) real parameter

, corresponding to its "general tendency to have hypertext links", and also an (unknown) non-negative parameter

, corresponding to its "tendency to have intelligent hypertext links to authoritative sites". We further suppose that each authority

has an (unknown) non-negative parameter

, corresponding to its level of authority.

Our statistical model is as follows. The a priori probability of a link from hub

to authority

is given by

This reflects the idea that a link is more likely if

is large (in which case hub

has large tendency to link to any site), or if both

and

are large (in which case

is an intelligent hub, and

is a high-quality authority).

To complete the specification of the statistical model from a Bayesian point of view (see, e.g., Bernardo and Smith [2]), we must assign prior distributions to the

unknown parameters

, and

. (These priors should be general and uninformative, and should not depend on the observed data. For large graphs, the choice of priors should have only a small impact on the results.) To do this, we let

and

be fixed parameters, and let each

have prior distribution

, a normal distribution with mean

and variance

. We further let each

and

have prior distribution

(since they have to be non-negative), meaning that for

, ${bf P}(h_ige x) = {bf P}(a_j ge x) = exp(-x)$ .

The (standard) Bayesian inference method then proceeds from this fully-specified statistical model, by conditioning on the observed data, which in this case is the matrix

of actual observed hypertext links in the Base Set. Specifically, when we condition on the data

we obtain a posterior density $pi: {bf R}^{2M+N} to [0,infty)$ for the parameters

. This density is defined so that

for any (measurable) function $g : {bf R}^{2M+N} to {bf R}$ . An easy computation gives the following.

Our Bayesian algorithm then reports the conditional means of the

parameters, according to the posterior density

. That is, it reports final values $widehat{a}_j$ , $widehat{h}_i$ , and $widehat{e}_i$ , where, for example

To actually compute these conditional means is non-trivial. To accomplish this, we used a Metropolis Algorithm. (The Metropolis algorithm is an example of a Markov chain Monte Carlo Algorithm; for background see, e.g., Smith and Roberts [13]; Tierney [14]; Gilks et al. [6]; Roberts and Rosenthal [12]).

There is, of course, some arbitrariness in the specification of this Bayesian algorithm, e.g., in the form of the prior distributions and in the precise formula for the probability of a link from

. However, the model appears to work well in practice, as our experiments show. We note that it is possible that the priors for a new search query could instead depend on the performance of hub

on different previous searches, though we do not pursue that here.

This Bayesian algorithm is similar in spirit to the PHITS algorithm of Cohn and Chang [4] described earlier, in that both use statistical modeling, and both use an iterative algorithm to converge to an answer. However, the algorithms differ substantially in their details. Firstly, they use substantially different statistical models. Secondly, the PHITS algorithm uses a non-Bayesian (i.e. "classical" or "frequentist") statistical framework, as opposed to the Bayesian framework adopted here.

A Simplified Bayesian Algorithm

It is possible to simplify the above Bayesian model, by replacing equation (2) with ${bf P}(i to j) = ( a_j h_i) / (1+ a_j h_i )$ and correspondingly replace equation (3) with ${bf P}(i notto j) = 1 / (1+ a_j h_i )$ . This eliminates the parameters

entirely, so that we no longer need the prior values

and

This leads to a slightly modified posterior density

, now given by $pi: {bf R}^{M+N} to {bf R}^{ge 0}$ where

This Simplified Bayesian algorithm was designed to be to similar to the original Bayesian algorithm. Surprisingly, we will see that experimentally it often performs very similarly to the pSALSA algorithm.

Experimental Results

We have implemented the algorithms presented here on various queries. Because of space limitations we only report here (see Appendix A) a representative subset of results; all of our results (including the queries "death penalty", "computational complexity" and "gun control" which are not reported here) can be obtained at
http://www.cs.toronto.edu/tsap/experiments. The reader may find it easier to follow the discussion in the next section by accessing the full set of results. For the generation of the Base Set of pages, we follow the specifications of Kleinberg [8] described earlier. For each of the queries, we begin by generating a Root Set that consists of the first 200 pages returned by AltaVista on the same query. The Root Set is then expanded to the Base Set by including nodes that point to, or are pointed to, by the pages in the Root Set. In order to keep the size of the Base Set manageable, for every page in the Root Set, we only include the first 50 pages returned from AltaVista that point to this page. We then construct the graph induced by nodes in the Base Set, by discovering all links among the pages in the Base Set, eliminating those that are between pages of the same domain

For each query, we tested nine different algorithms on the same Base Set. We present the top ten authority sites returned by each of the algorithms. For evaluation purposes, we also include a list of the URL and title (possibly abbreviated) of each site which appears in the top five of one or more of the algorithms. For each page we also note the popularity of the page (denoted pop in the tables), that is, the number of different algorithms that rank it in the top ten sites. The pages that seem (to us) to be generally unrelated with the topic in hand appear bold-faced. We also present an "intersection table" which provides, for each pair of algorithms, the number of sites which were in the top ten according to both algorithms (maximum 10, minimum 0).

In the tables, SBayesian denotes the Simplified Bayesian algorithm, HubAvg denotes the Hub-Averaging Kleinberg algorithm, AThresh denotes the Authority-Threshold algorithm, HThresh denotes the Hub-Threshold algorithm, and FThresh denotes the Full-Threshold algorithm. For the
Authority-Threshold and Full-Threshold algorithms, we (arbitrarily) set the threshold

Discussion of Experimental Results

We observe from the experiments that different algorithms emerge as the "best" for different queries, while there are queries for which no algorithm seems to perform well. One prominent such case is the query on "net censorship" (also on "computational complexity") where only a few of the top ten pages returned by any of the algorithms can possibly be considered as authoritative on the subject. One possible explanation is that in these cases the topic is not well represented on the web, or there is no strong interconnected community. This reinforces a common belief that any commercial search engine cannot rely solely on link information, but rather must also examine the text content of sites to prevent such difficulties as "topic drift". On the other hand, in cases such as "death penalty" (not shown here), all algorithms converge to almost the same top ten pages, which are both relevant and authoritative. In these cases the community is well represented, and strongly interconnected.

The experiments also indicate the difference between the behavior of the Kleinberg algorithm and pSALSA, first observed for the SALSA algorithm in the original paper of Lempel and Moran [9]. Specifically, when computing the top authorities, the Kleinberg algorithm tends to concentrate on a "tightly knit community" of nodes (the TKC effect), while pSALSA, like SALSA, tends to mix the authorities of different communities in the top authorities. The TKC effect becomes clear in the "genetic" query, where the Kleinberg algorithm only reports pages on biology in the top ten while pSALSA mixes these pages with pages on genetic algorithms. It also becomes poignantly clear in the "movies" query (and also in the "gun control" and the "abortion" query), where the top ten pages reported by the Kleinberg algorithm are dominated by an irrelevant cluster of nodes from the about.com community. A more elaborate algorithm for detecting intra-domain links could help alleviate this problem. However, these examples seem indicative of the topic drift potential of the principal eigenvector in the Kleinberg algorithm.

On the other hand, the limitations of the pSALSA algorithm become obvious in the "computational geometry" query, where three out of the top ten pages belong to the unrelated w3.com community. They appear in the top positions because they are pointed to by a large collection of pages by ACM, which point to nothing else. A similar phenomenon explains the appearance of the "Yahoo!" page in the "genetic" query. We thus see that the simple heuristic of counting the in-degree as the authority weight is also imperfect.

We identify two types of characteristic behavior: the Kleinberg behavior, and the pSALSA behavior. The former ranks the authorities based on the structure of the entire graph, and tends to favor the authorities of tightly knit communities. The latter ranks the authorities based on their popularity in their immediate neighborhood, and favors various authorities from different communities. To see how the rest of the algorithms fit within these two types of behaviors, we compare the behavior of algorithms on a pairwise basis, using the number of intersections in their respective top ten authorities as an indication of agreement.

The first striking observation is that the Simplified Bayesian algorithm is almost identical to the pSALSA algorithm. The pSALSA algorithm and the Simplified Bayesian have at least

overlap on all queries. One possible explanation for this is that both algorithms place great importance on the in-degree of a node when determining the authority weight of a node. For the pSALSA algorithm we know that it is "local" in nature, that is, the authority weight assigned to a node depends only on the links that point to this node, and not on the structure of the whole graph. The Simplified Bayesian seems to possess a similar, yet weaker property; we explore the locality issue further in the next section. On the other hand, the Bayesian algorithm appears to resemble both the Kleinberg and the pSALSA behavior, leaning more towards the first. Indeed, although the Bayesian algorithm avoids the severe topic drift in the "movies" and the "gun control" queries (but not in the "abortion" case), it usually has higher intersection numbers with Kleinberg than with pSALSA. One possible explanation for this observation is the presence of the

parameters in the Bayesian algorithm (but not the Simplified Bayesian algorithm), which "absorb" some of the effect of many links pointing to a node, thus causing the authority weight of a node to be less dependent on its in-degree.

Another algorithm that seems to combine characteristics of both the pSALSA and the Kleinberg behavior is the Hub-Averaging algorithm. The Hub-Averaging algorithm is by construction a hybrid of the two since it alternates between one step of each algorithm. It shares certain behavior characteristics with the Kleinberg algorithm: if we consider a full bipartite graph, then the weights of the authorities increase exponentially fast for Hub-Averaging (the rate of increase, however, is the square root of that of the Kleinberg algorithm). However, if the component becomes infiltrated, by making one of the hubs point to a node outside the component, then the weights of the authorities in the component drop. This prevents the Hub-Averaging algorithm from completely following the drifting behavior of the Kleinberg algorithm in the "movies" query. Nevertheless, in the "genetic" query, Hub-Averaging agrees strongly with Kleinberg, focusing on sites of a single community, instead of mixing communities as does pSALSA

. On the other hand, Hub-Averaging and pSALSA share a common characteristic, since the Hub-Averaging algorithm tends to favor nodes with high in-degree: if we consider an isolated component of one authority with high in-degree, the authority weight of this node will increase exponentially fast. This explains the fact that the top three authorities for "computational geometry" are the w3.com pages that are also ranked highly by pSALSA (with Hub-Averaging giving a very high weight to all three authorities).

For the threshold algorithms, since they are modifications of the Kleinberg Algorithm, they are usually closer to the Kleinberg behavior. This is especially true for the Hub-Threshold algorithm. However, the benefit of eliminating unimportant hubs when computing authorities becomes obvious in the "abortion" query. If one looks further than the first ten pages returned by Kleinberg, one observes that after the first nine pages, which belong to the amazon.com community, the rest of the pages are on topic. The Hub-Threshold algorithm escapes this cluster, and moves directly to the relevant pages. The intersection between the top ten pages of Hub-Threshold, and the set of pages in the positions 10 to 20 in the Kleinberg algorithm is

The Authority-Threshold often appears to be most similar with the Hub-Averaging algorithm. This makes sense since these two algorithms have a similar underlying motivation. The best moment for Authority-Threshold is the "movies" query, where it reports the most relevant top ten pages among all algorithms. The Full-Threshold algorithm combines elements of both the Threshold algorithms; however, usually it reports in the top ten a mixture of the results of the two algorithms, rather than the best of the two.

Finally, the BFS algorithm is designed to be a generalization of the pSALSA algorithm, that combines some elements of the Kleinberg algorithm. Its behavior resembles both pSALSA and Kleinberg, with a tendency to favor pSALSA. In the "genetic" and "abortion" queries it demonstrates some mixing, but to a lesser extent than that of pSALSA. The most successful moments for BFS are the "abortion" and the "gun control" queries where it reports a set of top ten pages that are all on topic. An interesting question to investigate is how the behavior of the BFS algorithm is altered if we change the weighting scheme of the neighbors.

Theoretical Analysis

The experimental results of the previous section suggest that certain algorithms seem to share similar properties and ranking behavior. In this section, we initiate a (preliminary) formal study of fundamental properties and comparisons between ranking algorithms. For the purpose of following analysis we need some basic definitions and notation. Let ${cal G}_N$ be a collection of graphs of size

. One special case is to let ${cal G}_N$ be the set of all graphs of size

, hereafter denoted ${overline{cal G}}_N$ . We define a link analysis algorithm

as a function that maps a graph $G in {cal G}_N$ to an

-dimensional vector. We call the vector

the weight vector of algorithm

on graph

. The value of the entry

of vector

denotes the authority weight assigned by the algorithm

to the page

We can normalize the weight vector

under some chosen norm. The choice of normalization affects the definition of some of the properties of the algorithms, so we discriminate between algorithms that use different norms. For any norm

, we define the

-algorithm

to be the algorithm

, where the weight vector of

is normalized under

. We also examine unnormalized algorithms, where no normalization is performed at any stage of the algorithm. For example, the unnormalized pSALSA algorithm assigns a weight to page

equal to the in-degree of page

. For the following discussion, when not stated explicitly, we will assume that the weight vectors of the algorithms are normalized under the

norm (i.e. each weight is divided by the maximum weight); this gives weight 1 to the top authority, with other weights given as a fraction of the top weight. Due to space constraints, many proofs have been omitted in the following sections.

Monotonicity

Monotonicity appears to be a "reasonable" property that any sensible link-analysis algorithm should satisfy.

Similarity

Let

and

be two algorithms on ${cal G}_N$ . We shall consider the distance

between the weight vectors of

and

, for $G in {cal G}_N$ , where $d : {bf R}^n times {bf R}^n rightarrow {bf R}$ is some function that maps the weight vectors

and

to a real number

. We shall consider the Manhattan distance

, that is, the

norm of the difference of the weight vectors, given by $d_1(w_1,w_2) = sum_{i =1}^{N} vert w_1(i) -w_2(i)vert$ .

For this distance function, we now define the similarity between two

-algorithms as follows

$begin{definition}Two $L_infty$-algorithms $A_1$ and $A_2$ are similar on$......cal G}_N} d_1(A_1(G),A_2(G)) = o(N) .end{displaymath}end{definition}$

We also consider another distance function that attempts to capture the similarity between the ordinal rankings of two algorithms. The motivation behind this definition is that the ordinal ranking is the usual end-product seen by the user. Let

and

be the weight vectors of two algorithms

and

. We define the indicator function ${bf I}_{w_1w_2}(i,j)$ as follows

We note that ${bf I}_{w_1w_2}(i,j) = 0$ if and only if

. ${bf I}_{w_1w_2}(i,j)$ becomes one for each pair of nodes that are ranked differently. We define the "ranking distance" function

as follows.

Note that, unlike

, the distance between two weight vectors under

does not depend upon the choice of normalization.

$begin{definition}Two algorithms, $A_1$ and $A_2$, are rank matching on ${{c......begin{displaymath}d_r(A_1(G),A_2(G)) = 0 .end{displaymath}end{definition}$

$begin{proof}Consider a graph $G$ on $N = 3r$ nodes that consists of twodis......K,w_H) = Theta(N)$.Thus, the two algorithms are not rank matching.end{proof}$

Let ${cal G}_N^{nest}$ be the set of all size-

nested graphs. (Of course, ${cal G}_N^{nest}$ is a rather restricted set of size-

graphs.)

Stability and Locality

In the previous section we examined the similarity of two different algorithms on the same graph

. In this section we are interested on how the output of a fixed algorithm changes, as we alter the graph. We would like small changes in the graph to have a small effect on the weight vector of the algorithm. We capture this requirement by the definition of stability. For the following, let

denote the set of all edges (i.e. links) in the graph

. We assume that

, otherwise all properties that we discuss below are trivial. The following definition applies to unnormalized, and

-algorithms

$begin{definition}An algorithm $A$ is stable on ${{cal G}_N}$ if for every...... the graph $G$ with theedges $ell_1,ldots,ell_k$ removed.end{definition}$

$begin{definition}An algorithm $A$ is rank stable on ${{cal G}_N}$ if for ......us {ell_1, ldots, ell_k})) = o(N).end{displaymath}parend{definition}$

Stability may be a desirable property. Indeed, the algorithms all act on a base set which is generated using some other search engine (e.g. AltaVista [1]) and the associated hypertext links. Presumably with a "very good" base set, all the algorithms would perform well. However, if an algorithm is not stable, then slight changes in the base set (or its link structure) may lead to large changes in the rankings given by the algorithm. Thus, stability may provide "protection" from poor base sets. We note that the parameter

used in the definition of stability allows for an arbitrary scaling of the second weight vector, thus eliminating instability which is caused solely by different normalization factors.

We now introduce the idea of "locality". The idea behind locality is that a change in the in-links of a node should have only a small effect on the weights of the rest of the nodes.

$begin{definition}An algorithm $A$ is local if for graph $G$, and every link $......p^*}$, where $p^*$ is the page pointed to bythe link $ell$.end{definition}$

$begin{definition}An algorithm $A$ is pairwise local if for every graph $G$, a......}$, where $p^*$ is the documentlinked to by the link $ell$.end{definition}$

$begin{definition}An algorithm $A$ is rank local if for every graph $G$, and e......}$, where $p^*$ is the document linked to by thelink $ell$.end{definition}$

We note that locality depends on the normalization used, but pairwise locality and rank locality do not. The following lemmas are direct consequences of the definitions.

$begin{proof}Let $Gin {cal G}_n$ be a graph, and let $G' =Gsetminus{ell......Omega(1)$, $d_1(w, gamma' w') = o(N)$,therefore $A$ is stable.end{proof}$

We originally thought that the Bayesian and Simplified Bayesian Algorithms were also local. However, it turns out that they are neither local nor pairwise local. Indeed, it is true that conditional on the values of

, and

, the conditional distribution of

for

is unchanged upon removing a link from

. However, the unconditional marginal distribution of

, and hence also its posterior mean $widehat{a}_k$ (or even ratios $widehat{a}_k / widehat{a}_q$ for

), may still be changed upon removing a link from

. (Indeed, we have computed experimentally that $widehat{a}_3 / widehat{a}_4$ may change upon removing a link from 1 to 2, even for a simple example with just four nodes.) Hence, neither the Bayesian nor the Simplified Bayesian Algorithm is local or pairwise local.

Finally, we use locality and "label-independence" to prove a uniqueness property of the pSALSA algorithm.

All of our algorithms are clearly label-independent, and we would expect this property of any reasonable algorithm. We have the following.

$begin{proof}Let $G$ be any graph, and let $i$ be a node in $G$. Let $w_A= ......, or unnormalized variant of $A$ isalso rank matching with pSALSA.end{proof}$

Symmetry

Summary

We have considered a number of known and some new algorithms which use the hypertext link structure of World Wide Web pages to extract information about the relative ranking of these pages. In particular, we have introduced two algorithms based on Bayesian statistical approach as well as a number of algorithms which are modifications of Kleinberg's seminal hubs and authority algorithm. Based on 8 different queries (5 presented here), we discuss some observed properties of each algorithm as well as relationships between the algorithms. We found (experimentally) that certain algorithms appear to be more "balanced", while others more "focused". The latter tend to be sensitive to the existence of tightly interconnected clusters, which may cause them to drift. The intersections between the lists of the top-ten results of the algorithms suggest that certain algorithms exhibit similar behavior and properties.

Motivated by the experimental observations, we introduced a theoretical framework for the study and comparison of link-analysis ranking algorithms. We formally defined (and gave some preliminary results for) the concepts of monotonicity and locality, as well as various concepts of distance and similarity between ranking algorithms.

Our work leaves a number of interesting open questions. The two Bayesian algorithms open the door to the use of other statistical and machine learning techniques for ranking of hyper-linked documents. Furthermore, the framework we defined suggests a number of interesting directions for the theoretical study of ranking algorithms, which we have just begun to explore in this work.

Bibliography

Experiments

Query: abortion (Base Set size = 2293)

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
1.	P-1165	P-717	P-1165	P-1165	P-717	P-1165	P-717	P-717	P-717
2.	P-1184	P-1461	P-1184	P-1184	P-1461	P-1184	P-719	P-1461	P-1192
3.	P-1193	P-719	P-1193	P-1193	P-1769	P-1193	P-1769	P-1191	P-1165
4.	P-1187	P-1165	P-1187	P-1187	P-719	P-1461	P-1461	P-719	P-1191
5.	P-1188	P-1184	P-1188	P-1188	P-1	P-719	P-962	P-1184	P-1193
6.	P-1189	P-1193	P-1189	P-1189	P-718	P-717	P-0	P-1165	P-1184
7.	P-1190	P-1187	P-1190	P-1190	P-0	P-1	P-2	P-1192	P-1189
8.	P-1191	P-1188	P-1191	P-1191	P-115	P-0	P-718	P-1193	P-1188
9.	P-1192	P-1189	P-1192	P-1192	P-2515	P-115	P-1325	P-1188	P-1187
10.	P-717	P-1190	P-1948	P-1948	P-962	P-607	P-1522	P-1187	P-1190

Index	pop	URL	Title
P-0	3	www.gynpages.com	Abortion Clinics OnLine
P-1	2	www.prochoice.org	NAF - The Voice of Abortion Providers
P-2	1	www.cais.com/agm/main	The Abortion Rights Activist Home Page
P-115	2	www.ms4c.org	Medical Students for Choice
P-607	1	www.feministcampus.org	Feminist Campus Activism Online: Welcome
P-717	7	www.nrlc.org	National Right to Life Organization
P-718	2	www.hli.org	Human Life International (HLI)
P-719	5	www.naral.org	NARAL: Abortion and Reproductive Rights: ...
P-962	2	www.prolife.org/ultimate	Empty title field
P-1165	7	www5.dimeclicks.com	DimeClicks.com - Web and Marketing Solutions
P-1184	7	www.amazon.com/...../youdebatecom	Amazon.com-Earth's Biggest Selection
P-1187	6	www.amazon.com/...../top-sellers.html	Amazon.com-Earth's Biggest Selection
P-1188	6	www.amazon.com/.../software/home.html	Amazon.com Software
P-1189	5	www.amazon.com/.../hot-100-music.html	Amazon.com-Earth's Biggest Selection
P-1190	5	www.amazon.com/.../gifts.html	Amazon.com-Earth's Biggest Selection
P-1191	5	www.amazon.com/.....top-100-dvd.html	Amazon.com-Earth's Biggest Selection
P-1192	5	www.amazon.com/...top-100-video.html	Amazon.com-Earth's Biggest Selection
P-1193	7	rd1.hitbox.com/.......	HitBox.com - hitbox web site .......
P-1325	1	www.serve.com/fem4life	Feminists For Life of America
P-1461	5	www.plannedparenthood.org	Planned Parenthood Federation of America
P-1522	1	www.naralny.org	NARAL/NY
P-1769	2	www.priestsforlife.org	Priests for Life Index
P-1948	2	www.politics1.com/issues.htm	Politics1: Hot Political Debates & Issues
P-2515	1	www.ohiolife.org	Ohio Right To Life

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
Kleinberg	10	8	9	9	1	4	1	8	10
pSALSA	8	10	7	7	3	6	3	8	8
HubAvg	9	7	10	10	0	3	0	7	9
AThresh	9	7	10	10	0	3	0	7	9
HThresh	1	3	0	0	10	6	7	3	1
FThresh	4	6	3	3	6	10	4	6	4
BFS	1	3	0	0	7	4	10	3	1
SBayesian	8	8	7	7	3	6	3	10	8
Bayesian	10	8	9	9	1	4	1	8	10

Query: +net +censorship (Base Set size = 2947)

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
1.	P-375	P-371	P-371	P-371	P-375	P-375	P-371	P-371	P-375
2.	P-3163	P-1440	P-2874	P-375	P-1344	P-371	P-375	P-1440	P-371
3.	P-3180	P-375	P-2871	P-2871	P-3132	P-1344	P-1299	P-375	P-3180
4.	P-3177	P-2874	P-2873	P-2874	P-3163	P-3130	P-1440	P-2874	P-3177
5.	P-3173	P-2871	P-2659	P-3536	P-3166	P-3131	P-2871	P-1299	P-3163
6.	P-3172	P-1299	P-375	P-2873	P-3167	P-3132	P-2874	P-2871	P-3173
7.	P-3132	P-3536	P-3536	P-2659	P-3168	P-3133	P-3536	P-3536	P-3166
8.	P-3193	P-1712	P-2639	P-2639	P-3170	P-3135	P-1802	P-1712	P-3193
9.	P-3170	P-268	P-1440	P-1440	P-3171	P-3161	P-2639	P-268	P-3168
10.	P-3166	P-1445	P-2867	P-3180	P-3172	P-3162	P-452	P-1445	P-3132

Index	pop	URL	Title
P-268	2	www.epic.org	Electronic Privacy Information Center
P-371	7	www.yahoo.com	Yahoo!
P-375	9	www.cnn.com	CNN.com
P-452	1	www.mediachannel.org	MediaChannel.org - A Global Network ........
P-1299	3	www.eff.org/blueribbon.html	EFF Blue Ribbon Campaign
P-1344	2	www.igc.apc.org/peacenet	PeaceNet Home
P-1440	5	www.eff.org	EFF ... - the Electronic Frontier Foundation
P-1445	2	www.cdt.org	The Center for Democracy and Technology
P-1712	2	www.aclu.org	ACLU: American Civil Liberties Union
P-1802	1	ukonlineshop.about.com	Online Shopping: UK
P-2639	3	www.imdb.com	The Internet Movie Database (IMDb).
P-2659	2	www.altavista.com	AltaVista - Welcome
P-2867	1	home.netscape.com	Empty title field
P-2871	5	www.excite.com	My Excite Start Page
P-2873	2	www.mckinley.com	Welcome to Magellan!
P-2874	5	www.lycos.com	Lycos
P-3130	1	www.city.net/countries/kyrgyzstan	Excite Travel
P-3131	1	www.bishkek.su/krg/Contry.html	ElCat. 404: Not Found.
P-3132	4	www.pitt.edu/~cjp/rees.html	REESWeb: Programs:
P-3133	1	www.ripn.net	RIPN
P-3135	1	www.yahoo.com/.../Kyrgyzstan	Yahoo! Regional Countries Kyrgyzstan
P-3161	1	151.121.3.140/fas/fas-publications/...	Error 404 Redirector
P-3162	1	www.rferl.org/BD/KY	RFE/RL Kyrgyz Service : News
P-3163	3	www.usa.ft.com	Empty title field
P-3166	3	www.pathfinder.com/time/daily	TIME.COM
P-3167	1	travel.state.gov	US State Department - Services - Consular Affairs
P-3168	2	www.yahoo.com/News	Yahoo! News and Media
P-3170	2	www.financenet.gov	...FinanceNet is the government's official home...
P-3171	1	www.securities.com	ISI Emerging Markets
P-3172	2	www.oecd.org	OECD Online
P-3173	2	www.worldbank.org	The World Bank Group
P-3177	2	www.envirolink.org	EnviroLink Network
P-3180	3	www.lib.utexas.edu/.../Map_collection	PCL Map Collection
P-3193	2	www.wiesenthal.com	Simon Wiesenthal Center
P-3536	5	www.shareware.com	CNET.com - Shareware.com

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
Kleinberg	10	1	1	2	6	2	1	1	8
pSALSA	1	10	6	6	1	2	7	10	2
HubAvg	1	6	10	9	1	2	7	6	2
AThresh	2	6	9	10	1	2	7	6	3
HThresh	6	1	1	1	10	3	1	1	5
FThresh	2	2	2	2	3	10	2	2	3
BFS	1	7	7	7	1	2	10	7	2
SBayesian	1	10	6	6	1	2	7	10	2
Bayesian	8	2	2	3	5	3	2	2	10

Query: Movies (Base Set size = 5757)

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
1.	P-678	P-999	P-999	P-999	P-678	P-999	P-999	P-999	P-999
2.	P-2268	P-2832	P-2832	P-2832	P-2266	P-2832	P-2832	P-2832	P-2832
3.	P-2304	P-6359	P-803	P-6359	P-2268	P-1989	P-2827	P-6359	P-2827
4.	P-2305	P-2827	P-1539	P-2827	P-999	P-1911	P-2803	P-2827	P-6359
5.	P-2306	P-2120	P-2101	P-2838	P-2832	P-1980	P-5470	P-1374	P-678
6.	P-2308	P-1374	P-1178	P-6446	P-2263	P-1983	P-2120	P-2120	P-2838
7.	P-2310	P-803	P-6359	P-5	P-2264	P-1984	P-4577	P-803	P-2266
8.	P-2266	P-1539	P-1082	P-2803	P-2265	P-1986	P-5	P-6446	P-2268
9.	P-2325	P-6446	P-2827	P-2839	P-2280	P-1987	P-2838	P-1539	P-2308
10.	P-2299	P-2838	P-6446	P-2840	P-2304	P-1993	P-4534	P-2838	P-2330

Index	pop	Title	URL
P-5	2	www.movies.com	Movies.com
P-678	3	chatting.about.com	Empty title field
P-803	3	www.google.com	Google
P-999	8	www.moviedatabase.com	The Internet Movie Database (IMDb).
P-1082	1	www.amazon.com/	Amazon.com-Earth's Biggest Selection
P-1178	1	www.booksfordummies.com	Empty title field
P-1374	2	www.onwisconsin.com	On Wisconsin
P-1539	3	206.132.25.51	Washingtonpost.com - News Front
P-1911	1	people2people.com/...nytoday	People2People.com - Search
P-1980	1	newyork.urbanbaby.com/nytoday	Kids & Family
P-1983	1	tunerc1.va.everstream.com/nytoday/	Empty title field
P-1984	1	nytoday.opentable.com/	OpenTable
P-1986	1	www.nytimes.com/.../jobmarket	The New York Times: Job Market
P-1987	1	www.cars.com/nytimes	New York Today cars.com - new and used car ...
P-1989	1	www.nytodayshopping.com	New York Today Shopping - Shop for computers, ...
P-1993	1	www.nytimes.com/.../nytodaymediakit	New York Today - Online Media Kit
P-2101	1	www2.ebay.com/aw/announce.shtml	eBay Announcement Board
P-2120	3	www.mylifesaver.com	welcome to mylifesaver.com
P-2263	1	clicks.about.com/...nationalinterbank	Banking Center
P-2264	1	clicks.about.com/	Credit Report, Free Trial Offer
P-2265	1	membership.about.com/...	Member Center
P-2266	3	home.about.com/movies	About - Movies
P-2268	3	a-zlist.about.com	About.com A-Z
P-2280	1	sprinks.about.com	Sprinks : About Sprinks
P-2299	1	home.about.com/aboutaus	About Australia
P-2304	2	home.about.com/arts	About - Arts/Humanities
P-2305	1	home.about.com/autos	About - Autos
P-2306	1	home.about.com/citiestowns	About - Cities/Towns
P-2308	2	home.about.com/compute	About - Computing/Technology
P-2310	1	home.about.com/education	About - Education
P-2325	1	home.about.com/musicperform	About - Music/Performance
P-2330	1	home.about.com/recreation	About - Recreation/Outdoors
P-2803	2	www.allmovie.com	All Movie Guide
P-2827	6	www.film.com	Film.com Movie Reviews, News, Trailers...
P-2832	8	www.hollywood.com	Hollywood.com - Your entertainment source...
P-2838	5	www.mca.com	Universal Studios
P-2839	1	www.mgmua.com	MGM - Home Page
P-2840	1	www.miramax.com	Welcome to the Miramax Cafe
P-4534	1	www.aint-it-cool-news.com	Ain't It Cool News
P-4577	1	go.com	GO.com
P-5470	1	www.doubleaction.net	Double Action - Stand. Point. Laugh.
P-6359	5	www.paramount.com	Paramount Pictures - Home Page
P-6446	4	www.disney.com	Disney.com - Where the Magic Lives Online!

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
Kleinberg	10	0	0	0	4	0	0	0	4
pSALSA	0	10	7	6	2	2	5	10	5
HubAvg	0	7	10	5	2	2	3	7	4
AThresh	0	6	5	10	2	2	6	6	5
HThresh	4	2	2	2	10	2	2	2	5
FThresh	0	2	2	2	2	10	2	2	2
BFS	0	5	3	6	2	2	10	5	4
SBayesian	0	10	7	6	2	2	5	10	5
Bayesian	4	5	4	5	5	2	4	5	10

Query: genetic (Base Set size = 3468)

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
1.	P-2187	P-2187	P-2187	P-2187	P-2187	P-2187	P-2187	P-2187	P-2187
2.	P-1057	P-258	P-1057	P-1057	P-1057	P-1057	P-3932	P-258	P-1057
3.	P-2168	P-1057	P-3932	P-2168	P-2168	P-2168	P-1538	P-1057	P-2168
4.	P-2200	P-3932	P-2095	P-2200	P-2200	P-2200	P-1057	P-3932	P-2095
5.	P-2219	P-2095	P-2168	P-2219	P-2219	P-2219	P-2095	P-2095	P-2200
6.	P-2199	P-1538	P-2186	P-2095	P-2199	P-2095	P-258	P-1538	P-2219
7.	P-2095	P-2	P-941	P-3932	P-2186	P-3932	P-2168	P-2	P-3932
8.	P-2186	P-2168	P-0	P-2199	P-2095	P-2199	P-2200	P-2168	P-2199
9.	P-2193	P-941	P-2200	P-2186	P-2193	P-2186	P-2	P-941	P-2186
10.	P-3932	P-23	P-2199	P-2193	P-3932	P-2193	P-2199	P-2200	P-2193

Index	pop	URL	Title
P-0	1	www.geneticalliance.org	Genetic Alliance, Washington, DC
P-2	3	www.genetic-programming.org	genetic-programming.org-Home-Page
P-23	1	www.geneticprogramming.com	The Genetic Programming Notebook
P-258	3	www.aic.nrl.navy.mil/galist	The Genetic Algorithms Archive
P-941	3	www3.ncbi.nlm.nih.gov/Omim	OMIM Home Page - Online Mendelian Inheritance in Man
P-1057	9	gdbwww.gdb.org	The Genome Database
P-1538	3	www.yahoo.com	Yahoo!
P-2095	9	www.nhgri.nih.gov	National Human Genome Research Institute (NHGRI)
P-2168	9	www-genome.wi.mit.edu	Welcome To the ..... Center for Genome Research
P-2186	6	www.ebi.ac.uk	EBI, the European Bioinformatics Institute ........
P-2187	9	www.ncbi.nlm.nih.gov	NCBI HomePage
P-2193	5	www.genome.ad.jp	GenomeNet WWW server
P-2199	7	www.hgmp.mrc.ac.uk	UK MRC HGMP-RC
P-2200	8	www.tigr.org	The Institute for Genomic Research
P-2219	5	www.sanger.ac.uk	The Sanger Centre Web Server
P-3932	9	www.nih.gov	National Institutes of Health (NIH)

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
Kleinberg	10	5	8	10	10	10	7	6	10
pSALSA	5	10	6	5	5	5	8	9	5
HubAvg	8	6	10	8	8	8	7	7	8
AThresh	10	5	8	10	10	10	7	6	10
HThresh	10	5	8	10	10	10	7	6	10
FThresh	10	5	8	10	10	10	7	6	10
BFS	7	8	7	7	7	7	10	9	7
SBayesian	6	9	7	6	6	6	9	10	6
Bayesian	10	5	8	10	10	10	7	6	10

Query: +computational +geometry (Base Set size = 1226)

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
1.	P-161	P-161	P-634	P-161	P-0	P-161	P-0	P-161	P-161
2.	P-0	P-1	P-632	P-0	P-161	P-0	P-161	P-1	P-0
3.	P-1	P-0	P-633	P-1	P-1	P-1	P-1	P-0	P-1
4.	P-162	P-3	P-161	P-162	P-162	P-162	P-300	P-3	P-162
5.	P-3	P-280	P-1	P-3	P-280	P-280	P-299	P-280	P-3
6.	P-280	P-634	P-1406	P-280	P-3	P-3	P-162	P-634	P-280
7.	P-275	P-162	P-0	P-275	P-275	P-275	P-3	P-162	P-275
8.	P-299	P-2	P-3	P-299	P-299	P-299	P-280	P-2	P-299
9.	P-300	P-632	P-162	P-300	P-300	P-848	P-375	P-633	P-300
10.	P-848	P-633	P-280	P-848	P-848	P-300	P-551	P-632	P-848

Index	pop	URL	Title
P-0	9	www.geom.umn.edu/software/cglist	Directory of Computational Geometry Software
P-1	9	www.cs.uu.nl/CGAL	The former CGAL home page
P-2	2	link.springer.de/link/service/journals/00454	LINK: Peak-time overload
P-3	9	www.scs.carleton.ca/~csgs/resources/cg.html	Computational Geometry Resources
P-161	9	www.ics.uci.edu/~eppstein/geom.html	Geometry in Action
P-162	9	www.ics.uci.edu/~eppstein/junkyard	The Geometry Junkyard
P-275	5	www.ics.uci.edu/~eppstein	David Eppstein
P-280	9	www.geom.umn.edu	The Geometry Center Welcome Page
P-299	6	www.mpi-sb.mpg.de/LEDA/leda.html	LEDA - Main Page of LEDA Research
P-300	6	www.cs.sunysb.edu/~algorith	The Stony Brook Algorithm Repository
P-375	1	graphics.lcs.mit.edu/~seth	Seth Teller
P-551	1	www.cs.sunysb.edu/~skiena	Steven Skiena
P-632	3	www.w3.org/Style/CSS/Buttons	CSS button
P-633	3	jigsaw.w3.org/css-validator	W3C CSS Validation Service
P-634	3	validator.w3.org	W3C HTML Validation Service
P-848	5	www.inria.fr/prisme/....../cgt	CG Tribune
P-1406	1	www.informatik.rwth-aachen.de/.....	Department of Computer Science, Aachen

	Kleinberg	pSALSA	HubAvg	AThresh	HThresh	FThresh	BFS	SBayesian	Bayesian
Kleinberg	10	6	6	10	10	10	8	6	10
pSALSA	6	10	9	6	6	6	6	10	6
HubAvg	6	9	10	6	6	6	6	9	6
AThresh	10	6	6	10	10	10	8	6	10
HThresh	10	6	6	10	10	10	8	6	10
FThresh	10	6	6	10	10	10	8	6	10
BFS	8	6	6	8	8	8	10	6	8
SBayesian	6	10	9	6	6	6	6	10	6
Bayesian	10	6	6	10	10	10	8	6	10

Finding Authorities and Hubs From Link Structures on the World Wide Web

Abstract

Introduction

Previous Algorithms

The PageRank Algorithm

Kleinberg's Algorithm

The SALSA Algorithm

The PHITS Algorithm

Random Walks and the Kleinberg Algorithm

Some modifications to the Kleinberg and SALSA Algorithms

The Hub-Averaging-Kleinberg Algorithm

The Threshold-Kleinberg Algorithms

The Breadth-First-Search Algorithm: A Normalized -step Variant

A Bayesian Algorithm

A Simplified Bayesian Algorithm

Experimental Results

Discussion of Experimental Results

Theoretical Analysis

Monotonicity

Similarity

Stability and Locality

Symmetry

Summary

Bibliography

Experiments

Query: `abortion (Base Set size = 2293)`

Query: `+net +censorship (Base Set size = 2947)`

Query: `Movies (Base Set size = 5757)`

Query: `genetic (Base Set size = 3468)`

Query: `+computational +geometry (Base Set size = 1226)`

Footnotes

$displaystyle {{bf P}Big( (e_1,ldots,e_M,h_1,ldots,h_M,a_1,ldots,a_N)in S Bigmvert {A_{ij}} Big)}$

			(3)

Finding Authorities and Hubs From Link Structures on the World Wide Web

Abstract

Theoretical Analysis

Query: abortion (Base Set size = 2293)

Query: +net +censorship (Base Set size = 2947)

Query: Movies (Base Set size = 5757)

Query: genetic (Base Set size = 3468)

Query: +computational +geometry (Base Set size = 1226)

Footnotes

Query: `abortion (Base Set size = 2293)`

Query: `+net +censorship (Base Set size = 2947)`

Query: `Movies (Base Set size = 5757)`

Query: `genetic (Base Set size = 3468)`

Query: `+computational +geometry (Base Set size = 1226)`