next up previous
Next: Conclusions Up: Results and Discussion Previous: Use of RealNames

Discussion

Analyzing some of the queries for which the TREC algorithms fail, we find that the most common reason for their failure is the presence of the query words with high frequency in non-relevant pages. For example, for the query ``laguardia airport'', the top ranked page (for the one-pass algorithm) is the flight schedule page for Tompkins County Airport (in Ithaca, NY, USA). This flight schedule contains the query word ``laguardia'' some ten times and gets a very high tf$\times$idf based score. Similarly, for the query ``american Kennel club'', the top ranked page is a list of dog clubs, many of which have the query words in them. This list resides on the site doghobbyist.com. This is an obvious problem with keyword-based ranking systems, and we do see this problem hurting the results from our TREC algorithms.

On an in-depth examination, we notice why the more expensive two-pass system is worse than the one-pass system. For example, consider the query ``horizon blue cross blue shield''(#). The one-pass system retrieves the relevant page, www.bcbsnj.com, as the top ranked page. However, the first pass also retrieves many health insurance/care related pages in the top ten pages. In the query expansion step, this query loses its focus on ``horizon blue cross blue shield'' and instead becomes a general health insurance query, failing to retrieve the desired page in the second pass. This loss of focus is observed for many other queries in our set.

It is worth noting that many pages retrieved by the TREC algorithms are quite relevant to the topic at hand. They are just not the page the user was looking for in our experiments. Under the TREC criteria for judging relevance, many of these pages are ``on topic'' and will be judged relevant. This would explain why under the TREC measurements, the commercial engines do not show any advantage over the TREC algorithms.


next up previous
Next: Conclusions Up: Results and Discussion Previous: Use of RealNames
Amit Singhal 2001-02-18