|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.segment.SegmentSlicer
This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.
Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.
NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.
NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.
NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.
Field Summary | |
static Logger |
LOG
|
static int |
LOG_STEP
|
Constructor Summary | |
SegmentSlicer(NutchFileSystem nfs,
File[] input,
File output,
boolean withContent,
boolean withParseText,
boolean withParseData,
boolean autoFix,
long maxCount,
boolean plusSign,
org.apache.oro.text.regex.Pattern pattern)
Create new SegmentSlicer. |
Method Summary | |
static void |
main(String[] args)
Command-line wrapper. |
void |
run()
Run the slicer. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final Logger LOG
public static int LOG_STEP
Constructor Detail |
public SegmentSlicer(NutchFileSystem nfs, File[] input, File output, boolean withContent, boolean withParseText, boolean withParseData, boolean autoFix, long maxCount, boolean plusSign, org.apache.oro.text.regex.Pattern pattern)
nfs
- filesysteminput
- list of input segmentsoutput
- output directory, created if not exists. Output segments
will be created inside this directorywithContent
- if true, read content, otherwise ignore itwithParseText
- if true, read parse_text, otherwise ignore itwithParseData
- if true, read parse_data, otherwise ignore itautoFix
- if true, attempt to fix corrupt segmentsmaxCount
- if greater than 0, determines the maximum number of entries
per output segment. New multiple output segments will be created as needed.Method Detail |
public void run()
run
in interface Runnable
public static void main(String[] args) throws Exception
Exception
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |