org.apache.nutch.segment
Class SegmentSlicer

java.lang.Object
  extended byorg.apache.nutch.segment.SegmentSlicer
All Implemented Interfaces:
Runnable

public class SegmentSlicer
extends Object
implements Runnable

This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.

Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.

NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.

NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.

NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.

Author:
Andrzej Bialecki <[email protected]>

Field Summary
static Logger LOG
           
static int LOG_STEP
           
 
Constructor Summary
SegmentSlicer(NutchFileSystem nfs, File[] input, File output, boolean withContent, boolean withParseText, boolean withParseData, boolean autoFix, long maxCount, boolean plusSign, org.apache.oro.text.regex.Pattern pattern)
          Create new SegmentSlicer.
 
Method Summary
static void main(String[] args)
          Command-line wrapper.
 void run()
          Run the slicer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG

LOG_STEP

public static int LOG_STEP
Constructor Detail

SegmentSlicer

public SegmentSlicer(NutchFileSystem nfs,
                     File[] input,
                     File output,
                     boolean withContent,
                     boolean withParseText,
                     boolean withParseData,
                     boolean autoFix,
                     long maxCount,
                     boolean plusSign,
                     org.apache.oro.text.regex.Pattern pattern)
Create new SegmentSlicer.

Parameters:
nfs - filesystem
input - list of input segments
output - output directory, created if not exists. Output segments will be created inside this directory
withContent - if true, read content, otherwise ignore it
withParseText - if true, read parse_text, otherwise ignore it
withParseData - if true, read parse_data, otherwise ignore it
autoFix - if true, attempt to fix corrupt segments
maxCount - if greater than 0, determines the maximum number of entries per output segment. New multiple output segments will be created as needed.
Method Detail

run

public void run()
Run the slicer.

Specified by:
run in interface Runnable

main

public static void main(String[] args)
                 throws Exception
Command-line wrapper. Run without arguments to see usage help.

Throws:
Exception


Copyright © 2006 The Apache Software Foundation