SegmentSlicer (Nutch 0.7.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.segment
Class SegmentSlicer

java.lang.Object
  org.apache.nutch.segment.SegmentSlicer

All Implemented Interfaces:: Runnable

public class SegmentSlicer
extends Object
implements Runnable

This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.

Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.

NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.

NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.

NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.

Author:: Andrzej Bialecki <[email protected]>

Field Summary
`static Logger`	`LOG`
`static int`	`LOG_STEP`

Constructor Summary
`SegmentSlicer(NutchFileSystem nfs, File[] input, File output, boolean withContent, boolean withParseText, boolean withParseData, boolean autoFix, long maxCount, boolean plusSign, org.apache.oro.text.regex.Pattern pattern)` Create new SegmentSlicer.

Method Summary
`static void`	`main(String[] args)` Command-line wrapper.
`void`	`run()` Run the slicer.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

LOG

public static final Logger LOG

LOG_STEP

public static int LOG_STEP

Constructor Detail

SegmentSlicer

public SegmentSlicer(NutchFileSystem nfs,
                     File[] input,
                     File output,
                     boolean withContent,
                     boolean withParseText,
                     boolean withParseData,
                     boolean autoFix,
                     long maxCount,
                     boolean plusSign,
                     org.apache.oro.text.regex.Pattern pattern)

Create new SegmentSlicer.
Parameters:: nfs - filesystem; input - list of input segments; output - output directory, created if not exists. Output segments will be created inside this directory; withContent - if true, read content, otherwise ignore it; withParseText - if true, read parse_text, otherwise ignore it; withParseData - if true, read parse_data, otherwise ignore it; autoFix - if true, attempt to fix corrupt segments; maxCount - if greater than 0, determines the maximum number of entries per output segment. New multiple output segments will be created as needed.

Method Detail

run

public void run()

Run the slicer.

Specified by:: run in interface Runnable

main

public static void main(String[] args)
                 throws Exception

Command-line wrapper. Run without arguments to see usage help.

Throws:: Exception