Collection of Pages

Next: TREC Ad-Hoc Algorithm Up: Experimental Environment Previous: Evaluation Measures

Collection of Pages

To eliminate the problems associated with the collection of web pages used in previous studies (see problems 3 and 5 in Section 2), we run our TREC ad-hoc algorithm over 217.5 gigabytes of freshly crawled web data (crawled between October 14-17, 2000) containing 17.8 million web pages. The assumption is that the commercial web search engines also have the same (fresh) copy of the pages crawled. The objective is to make sure that the underlying collection available to the our TREC algorithm is similar to the collection used by the commercial engines.

Since just 217.5 gigabytes of web data will not contain all the pages indexed by the commercial search engines, the TREC algorithm might be at a disadvantage because of the poor coverage of our crawl. To eliminate this problem, for every query in our test set, we add to our crawl all missing pages that are retrieved in the top ten ranks by any of the commercial search engines. We ran these queries on the commercial engines on October 17, 2000 and gathered the first ten results for each. We then fetched the pages that were not in our crawl and added them to our collection. This inclusion ensures that the TREC algorithms have access to all pages that have been retrieved by a commercial engine and are not at any disadvantage due to our small crawl. Even though quite unlikely, it is possible that we might have crawled pages that are not indexed by the commercial engines. This gives a slight advantage to the TREC ad-hoc algorithm in its ability to find such pages.

Next: TREC Ad-Hoc Algorithm Up: Experimental Environment Previous: Evaluation Measures

Amit Singhal 2001-02-18