|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.nutch.tools.SegmentMergeTool
This class cleans up accumulated segments data, and merges them into a single (or optionally multiple) segment(s), with no duplicates in it.
There are no prerequisites for its correct operation except for a set of already fetched segments (they don't have to contain parsed content, only fetcher output is required). This tool does not use DeleteDuplicates, but creates its own "master" index of all pages in all segments. Then it walks sequentially through this index and picks up only most recent versions of pages for every unique value of url or hash.
If some of the input segments are corrupted, this tool will attempt to
repair them, using
SegmentReader.fixSegment(NutchFileSystem, File, boolean, boolean, boolean, boolean)
method.
Output segment can be optionally split on the fly into several segments of fixed length.
The newly created segment(s) can be then optionally indexed, so that it can be either merged with more new segments, or used for searching as it is.
Old segments may be optionally removed, because all needed data has already been copied to the new merged segment. NOTE: this tool will remove also all corrupted input segments, which are not useable anyway - however, this option may be dangerous if you inadvertently included non-segment directories as input...
You may want to run SegmentMergeTool instead of following the manual procedures, with all options turned on, i.e. to merge segments into the output segment(s), index it, and then delete the original segments data.
Nested Class Summary | |
static class |
SegmentMergeTool.SegmentMergeStatus
|
Field Summary | |
static int |
INDEX_MERGE_FACTOR
|
static int |
INDEX_MIN_MERGE_DOCS
|
static int |
INDEX_SIZE
Temporary de-dup index size. |
static Logger |
LOG
|
static int |
LOG_STEP
Log progress update every LOG_STEP items. |
Constructor Summary | |
SegmentMergeTool(NutchFileSystem nfs,
File[] segments,
File output,
long maxCount,
boolean runIndexer,
boolean delSegs)
Create a SegmentMergeTool. |
Method Summary | |
SegmentMergeTool.SegmentMergeStatus |
getStatus()
|
static void |
main(String[] args)
|
void |
run()
Run the tool, periodically reporting progress. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final Logger LOG
public static int LOG_STEP
public static int INDEX_SIZE
public static int INDEX_MERGE_FACTOR
public static int INDEX_MIN_MERGE_DOCS
Constructor Detail |
public SegmentMergeTool(NutchFileSystem nfs, File[] segments, File output, long maxCount, boolean runIndexer, boolean delSegs) throws Exception
nfs
- filesystemsegments
- list of input segmentsoutput
- output directory, where output segments will be createdmaxCount
- maximum number of records per output segment. If this
value is 0, then the default value Long.MAX_VALUE
is used.runIndexer
- run indexer on output segment(s)delSegs
- delete input segments when finished
Exception
Method Detail |
public SegmentMergeTool.SegmentMergeStatus getStatus()
public void run()
run
in interface Runnable
public static void main(String[] args) throws Exception
Exception
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |