org.apache.nutch.tools
Class DistributedAnalysisTool

java.lang.Object
  extended byorg.apache.nutch.tools.DistributedAnalysisTool

public class DistributedAnalysisTool
extends Object

DistributedAnalysisTool performs link-analysis by reading exclusively from a IWebDBReader, and writing to an IWebDBWriter. This tool can be used in phases via the command line to compute the LinkAnalysis score across many machines. For a single iteration of LinkAnalysis, you must have: 1) An "initRound" step that writes down how the work should be divided. This outputs a "dist" directory which must be made available to later steps. It requires the input db directory. 2) As many simultaneous "computeRound" steps as you like, but this number must be determined in step 1. Each step may be run on different machines, or on the same, or however you like. It requires the the "db" and "dist" directories (or copies) as inputs. Each run will output an "instructions file". 3) A "completeRound" step, which integrates the results of all the many "computeRound" steps. It writes to a "db" directory. It assumes that all the instructions files have been gathered into a single "dist" input directory. If you're running everything on a single filesystem, this will happen easily. If not, then you will have to gather the files by hand (or with a script). For more iterations, repeat steps 1 - 3!

Author:
Mike Cafarella

Field Summary
static Logger LOG
           
static long OUTLINK_LIMIT
           
 
Constructor Summary
DistributedAnalysisTool(NutchFileSystem nfs, File dbDir)
          Give the pagedb and linkdb files and their cache sizes
 
Method Summary
 void completeRound(File distDir, File scoreFile)
          This method collates and executes all the instructions computed by the many executors of computeRound().
 void computeRound(int processId, File distDir)
          This method is invoked by one of the many processes involved in LinkAnalysis.
 boolean initRound(int numProcesses, File distDir)
          This method prepares the ground for a set of processes to distribute a round of LinkAnalysis work.
static void main(String[] argv)
          Kick off the link analysis.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG

OUTLINK_LIMIT

public static final long OUTLINK_LIMIT
See Also:
Constant Field Values
Constructor Detail

DistributedAnalysisTool

public DistributedAnalysisTool(NutchFileSystem nfs,
                               File dbDir)
                        throws IOException,
                               FileNotFoundException
Give the pagedb and linkdb files and their cache sizes

Method Detail

initRound

public boolean initRound(int numProcesses,
                         File distDir)
                  throws IOException
This method prepares the ground for a set of processes to distribute a round of LinkAnalysis work. It writes out the "assignments" to a directory. This directory must be made accessible to all the processes. (It may be mounted by all of them, or copied to all of them.) This is run by a single process, and it is run first.

Throws:
IOException

computeRound

public void computeRound(int processId,
                         File distDir)
                  throws IOException
This method is invoked by one of the many processes involved in LinkAnalysis. There will be many of these running at the same time. That's OK, though, since there's no locking that has to go on between them. This computes the LinkAnalysis score for a given region of the database. It writes its ID, the region params, and the scores-to-be-written into a flat file. This file is labelled according to its processid, and is found inside distDir.

Throws:
IOException

completeRound

public void completeRound(File distDir,
                          File scoreFile)
                   throws IOException
This method collates and executes all the instructions computed by the many executors of computeRound(). It figures out what to write by looking at all the flat files found in the distDir. These files are labelled according to the processes that filled them. This method will check to make sure all those files are present before starting work. If the processors are distributed, you might have to copy all the instruction files to a single distDir before starting this method. Of course, this method is executed on only one process. It is run last.

Throws:
IOException

main

public static void main(String[] argv)
                 throws IOException
Kick off the link analysis. Submit the locations of the Webdb and the number of iterations. DAT -initRound DAT -computeRound DAT -completeRound

Throws:
IOException


Copyright © 2006 The Apache Software Foundation