org.apache.nutch.util
Class ScoreStats

java.lang.Object
  extended byorg.apache.nutch.util.ScoreStats

public class ScoreStats
extends Object

When we generate a fetchlist, we need to choose a "cutoff" score, such that any scores above that cutoff will be included in the fetchlist. Any scores below will not be. (It is too hard to do the obvious thing, which is to sort the list of all pages by score, and pick the top K.) We need a good way to choose that cutoff. ScoreStats is used during LinkAnalysis to track the distribution of scores that we compute. We bucketize the scorespace into 2000 buckets. the first 1000 are equally-spaced counts for the range 0..1.0 (non-inclusive). The 2nd buckets are logarithmically spaced between 1 and Float.MAX_VALUE. If the score is < 1, then choose a bucket by (score / 1000) and choosing the incrementing the resulting slot. If the score is >1, then take the base-10 log, and take the integer floor. This should be an int no greater than 9. This is the hundreds-place digit for the index. (Since '1' is in the thousands-place.) Next, find where the score appears in the range between floor(log(score)), and ceiling(log(score)). The percentage of the distance between these two values is reflected in the final two digits for the index.

Author:
Mike Cafarella

Constructor Summary
ScoreStats()
           
 
Method Summary
 void addScore(float score)
          Increment the counter in the right place.
 void emitDistribution(PrintStream pout)
          Print out the distribution, with greater specificity for percentiles 90th - 100th.
static void main(String[] argv)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ScoreStats

public ScoreStats()
Method Detail

addScore

public void addScore(float score)
Increment the counter in the right place. We keep 2000 different buckets. Half of them are <1, and half are >1. Dies when it tries to fill bucket "1132"


emitDistribution

public void emitDistribution(PrintStream pout)
Print out the distribution, with greater specificity for percentiles 90th - 100th.


main

public static void main(String[] argv)
                 throws IOException
Throws:
IOException


Copyright © 2006 The Apache Software Foundation