org.apache.nutch.tools
Class SegmentMergeTool

java.lang.Object
  extended byorg.apache.nutch.tools.SegmentMergeTool
All Implemented Interfaces:
Runnable

public class SegmentMergeTool
extends Object
implements Runnable

This class cleans up accumulated segments data, and merges them into a single (or optionally multiple) segment(s), with no duplicates in it.

There are no prerequisites for its correct operation except for a set of already fetched segments (they don't have to contain parsed content, only fetcher output is required). This tool does not use DeleteDuplicates, but creates its own "master" index of all pages in all segments. Then it walks sequentially through this index and picks up only most recent versions of pages for every unique value of url or hash.

If some of the input segments are corrupted, this tool will attempt to repair them, using SegmentReader.fixSegment(NutchFileSystem, File, boolean, boolean, boolean, boolean) method.

Output segment can be optionally split on the fly into several segments of fixed length.

The newly created segment(s) can be then optionally indexed, so that it can be either merged with more new segments, or used for searching as it is.

Old segments may be optionally removed, because all needed data has already been copied to the new merged segment. NOTE: this tool will remove also all corrupted input segments, which are not useable anyway - however, this option may be dangerous if you inadvertently included non-segment directories as input...

You may want to run SegmentMergeTool instead of following the manual procedures, with all options turned on, i.e. to merge segments into the output segment(s), index it, and then delete the original segments data.

Author:
Andrzej Bialecki <[email protected]>

Nested Class Summary
static class SegmentMergeTool.SegmentMergeStatus
           
 
Field Summary
static int INDEX_MERGE_FACTOR
           
static int INDEX_MIN_MERGE_DOCS
           
static int INDEX_SIZE
          Temporary de-dup index size.
static Logger LOG
           
static int LOG_STEP
          Log progress update every LOG_STEP items.
 
Constructor Summary
SegmentMergeTool(NutchFileSystem nfs, File[] segments, File output, long maxCount, boolean runIndexer, boolean delSegs)
          Create a SegmentMergeTool.
 
Method Summary
 SegmentMergeTool.SegmentMergeStatus getStatus()
           
static void main(String[] args)
           
 void run()
          Run the tool, periodically reporting progress.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG

LOG_STEP

public static int LOG_STEP
Log progress update every LOG_STEP items.


INDEX_SIZE

public static int INDEX_SIZE
Temporary de-dup index size. Larger indexes tend to slow down indexing. Too many indexes slow down the subsequent index merging. It's a tradeoff value...


INDEX_MERGE_FACTOR

public static int INDEX_MERGE_FACTOR

INDEX_MIN_MERGE_DOCS

public static int INDEX_MIN_MERGE_DOCS
Constructor Detail

SegmentMergeTool

public SegmentMergeTool(NutchFileSystem nfs,
                        File[] segments,
                        File output,
                        long maxCount,
                        boolean runIndexer,
                        boolean delSegs)
                 throws Exception
Create a SegmentMergeTool.

Parameters:
nfs - filesystem
segments - list of input segments
output - output directory, where output segments will be created
maxCount - maximum number of records per output segment. If this value is 0, then the default value Long.MAX_VALUE is used.
runIndexer - run indexer on output segment(s)
delSegs - delete input segments when finished
Throws:
Exception
Method Detail

getStatus

public SegmentMergeTool.SegmentMergeStatus getStatus()

run

public void run()
Run the tool, periodically reporting progress.

Specified by:
run in interface Runnable

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2006 The Apache Software Foundation