TREC Web Tracks

Next: Experimental Environment Up: The TREC Ad-Hoc and Previous: The TREC Ad-Hoc and

TREC Web Tracks

An effort to evaluate web search is underway at TREC under the web track [7]. This track deals with some of the differences between web search and TREC ad-hoc, and uses an evaluation framework based on web data. This track uses a collection of web pages from an early 1997 crawl of the web done by the Internet Archive [5]. The queries are either selected from a web search engine's query log [5,7], or are the queries provided by the NIST assessors for the regular TREC ad-hoc task [6,7]. The evaluation measure used is precision in top twenty pages retrieved (i.e., proportion of relevant pages in top 20) [5,6,7].

One of the main aims of TREC web track has been to answer the question if link-based methods are better than keyword-based methods for web search. Most results coming out of the web track indicate that (as measured under TREC) link-based methods do not have any advantage over a state-of-the-art keyword-based TREC ad-hoc algorithm. For example, according to Hawking et al. in [6]:

...results are presented for an effectiveness comparison of six TREC systems ...against five well-known Web search systems .... These (results) suggest that the standard rankings produced by the public web search engines is by no means state-of-the-art.

...all five (public web) search engines performed below the median P@20 for (short) title-only (TREC) VLC2 submissions ...

Also, in [7] Hawking et al. say that

...Little benefit was derived from the use of link-based methods for standard TREC measures on the WT2g collection. ...One group investigated the use of PageRank scores and found no benefit on standard TREC measures. ...

On a similar note, Savoy and Picard in [17] say that as implemented in their study:

...Hyper-links do not result in any significant improvement ...

Overall, the sentiment in [5,6,7,17] is that when applied to web search, state-of-the-art keyword-based techniques used in TREC ad-hoc systems are as effective as link-based methods. Hawking et al. do accompany these counter-intuitive results with several shortcomings of the TREC environment that might be causing them. For example, in [7] they say:

...The number of inter-server links within WT2g may have been too small or it may be that link-based methods would have worked better with different types of queries and/or with different types of relevance judgments. ...

These caveats to the results presented in [5,6,7] are the main focus of this study. We observe the following shortcomings of the evaluations done in the TREC web track, and design a new evaluation which is aimed at removing these shortcomings to study the effectiveness of link-based vs. keyword-based search algorithms again:

The queries used in the TREC environment are mostly topical, i.e., they are aimed at finding relevant pages on various topics. Whereas in a real web search environment the users pose many different kinds of queries, e.g., find a particular site, or find high quality sites on a topic, or find merchants that sell something cheap, etc.,
The relevance judgments used in [5,6,7] are done on a per page basis and not on a per site⁽⁺⁾ basis. Even though the evaluation measure used--precision at rank 20--rightly measures the precision oriented nature of web search users, the page-based judgments ignore the (site-based) browsing aspect of the web.
For example, in doing an in-house pilot study, we found that for the query ``new york city subway'' (posed by one of our users) our TREC ad-hoc algorithm retrieved eighteen out of the top twenty pages from the site www.nycsubway.org, and all were judged relevant by our user. Most commercial search engines realize that this is not very desirable from the users' perspective, once on the site www.nycsubway.org, users like browsing the pages on that site themselves. Therefore, most commercial search engines group the results by site. Page based precision measurement tends to favor TREC ad-hoc algorithms which can retrieve twenty pages, all relevant, from a single site. On the other hand, site-based grouping done by most commercial web search engines artificially depresses the precision value for these engines (as measured under TREC) because it groups several relevant pages under one item and fills the list of ranks by other, possibly non-relevant, sites.
The problem that all relevant documents are not pertinent to a user is a long standing problem in retrieval evaluation [3]. Since pertinence is hard to quantify, most retrieval evaluations just use document relevance as the evaluation criteria. The web search engines, and in our opinion rightly so, take the view that multiple pages from the same site, even though relevant, are less pertinent as compared to relevant pages from different sites. The TREC evaluations ignore this aspect.
The web collection used in TREC evaluations is a 100 gigabyte collection with 18.5 million pages based on an early 1997 crawl done by the Internet Archive [5]. This collection is quite outdated with respect to the link structure of the current web. For example, the average number of cross-host out-links in the TREC collection is 1.56 per page, whereas in a recent crawl of the web we notice that the average number of cross-host out-links is 4.53 per page, almost three times as many. This indicates that there is a lot more linkage to be exploited in the current web compared to the web data used in TREC. This observation holds across all the comparative measurements we did. For example, the average number of in-host out-links is 5.57 per page for the TREC data, but it is much higher--11.63/page--for our recent crawl. Similarly, the average number of cross-host in-links for the TREC data is 0.12/page and this number is 2.08/page for our crawl. (There is a difference in the average number of out-links and the in-links per page because the out-links include links pointing to pages not in out collection.) Also, the average number of in-host in-links per page is 3.71 for TREC data and it is 6.31 for our crawl. (In all these measurements we only count links that have some valid anchor text attached to them.)
In a binary relevance model, as used in TREC, there is no notion of a relevant page being more or less relevant than another relevant page. However, on the web, there are clearly good and not so good pages on every topic. The quality of a web page is a subjective issue and the search engines tend to capture it numerically by (for example) the number of outside pages that point to a given page (assuming linkage as a form of recommendation of quality), and other such heuristics. This aspect is not captured in TREC evaluations.
Concentrating on particular results presented in [5,6] which show that the commercial web search engines are notably worse than modern TREC ad-hoc algorithms, we want to emphasize that in [5,6] the precision results for TREC algorithms are obtained on the TREC web data, whereas these results are compared to commercial web search engines running on a completely different and much larger and recent web crawl. Hawking et al., do acknowledge that this difference (in the underlying web collections) is a potential source of inconsistency in their results.

These shortcomings of previous work do not give us confidence that the results from these studies will hold in a realistic, more recent web search environment. In the following section, we describe our experimental environment which is aimed at removing these shortcomings and evaluating the effectiveness of current link-based web search engines vs. a state-of-the-art keyword-based TREC ad-hoc algorithm.

Next: Experimental Environment Up: The TREC Ad-Hoc and Previous: The TREC Ad-Hoc and

Amit Singhal 2001-02-18