org.apache.nutch.segment
Class SegmentReader

java.lang.Object
  extended byorg.apache.nutch.segment.SegmentReader

public class SegmentReader
extends Object

This class holds together all data readers for an existing segment. Some convenience methods are also provided, to read from the segment and to reposition the current pointer.

Author:
Andrzej Bialecki <[email protected]>

Field Summary
 ArrayFile.Reader contentReader
           
 ArrayFile.Reader fetcherReader
           
 long finished
          The time when fetching of this segment finished, as recorded in fetcher output data.
 boolean isParsed
           
static Logger LOG
           
 NutchFileSystem nfs
           
 ArrayFile.Reader parseDataReader
           
 ArrayFile.Reader parseTextReader
           
 File segmentDir
           
 long size
           
 long started
          The time when fetching of this segment started, as recorded in fetcher output data.
 
Constructor Summary
SegmentReader(File dir)
          Open a segment for reading.
SegmentReader(File dir, boolean autoFix)
          Open a segment for reading.
SegmentReader(NutchFileSystem nfs, File dir)
          Open a segment for reading.
SegmentReader(NutchFileSystem nfs, File dir, boolean autoFix)
          Open a segment for reading.
SegmentReader(NutchFileSystem nfs, File dir, boolean withContent, boolean withParseText, boolean withParseData, boolean autoFix)
          Open a segment for reading.
 
Method Summary
 void close()
          Close all readers.
 void dump(boolean sorted, PrintStream output)
          Dump the segment's content in human-readable format.
static boolean fixSegment(NutchFileSystem nfs, File dir, boolean withContent, boolean withParseText, boolean withParseData, boolean dryrun)
          Attempt to fix a partially corrupted segment.
 boolean get(long n, FetcherOutput fo, Content co, ParseText pt, ParseData pd)
          Get a specified entry from the segment.
static boolean isParsedSegment(NutchFileSystem nfs, File segdir)
           
 long key()
          Return the current key position.
static void main(String[] args)
          Command-line wrapper.
 boolean next(FetcherOutput fo, Content co, ParseText pt, ParseData pd)
          Read values from all open readers.
 void reset()
          Reset all readers.
 void seek(long n)
          Seek to a position in all readers.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG

fetcherReader

public ArrayFile.Reader fetcherReader

contentReader

public ArrayFile.Reader contentReader

parseTextReader

public ArrayFile.Reader parseTextReader

parseDataReader

public ArrayFile.Reader parseDataReader

isParsed

public boolean isParsed

started

public long started
The time when fetching of this segment started, as recorded in fetcher output data.


finished

public long finished
The time when fetching of this segment finished, as recorded in fetcher output data.


size

public long size

segmentDir

public File segmentDir

nfs

public NutchFileSystem nfs
Constructor Detail

SegmentReader

public SegmentReader(File dir)
              throws Exception
Open a segment for reading. If the segment is corrupted, do not attempt to fix it.

Parameters:
dir - directory containing segment data
Throws:
Exception

SegmentReader

public SegmentReader(NutchFileSystem nfs,
                     File dir)
              throws Exception
Open a segment for reading. If segment is corrupted, do not attempt to fix it.

Parameters:
nfs - filesystem
dir - directory containing segment data
Throws:
Exception

SegmentReader

public SegmentReader(File dir,
                     boolean autoFix)
              throws Exception
Open a segment for reading.

Parameters:
dir - directory containing segment data
autoFix - if true, and the segment is corrupted, attempt to fix errors and try to open it again. If the segment is corrupted, and autoFix is false, or it was not possible to correct errors, an Exception is thrown.
Throws:
Exception

SegmentReader

public SegmentReader(NutchFileSystem nfs,
                     File dir,
                     boolean autoFix)
              throws Exception
Open a segment for reading.

Parameters:
nfs - filesystem
dir - directory containing segment data
autoFix - if true, and the segment is corrupted, attempt to fix errors and try to open it again. If the segment is corrupted, and autoFix is false, or it was not possible to correct errors, an Exception is thrown.
Throws:
Exception

SegmentReader

public SegmentReader(NutchFileSystem nfs,
                     File dir,
                     boolean withContent,
                     boolean withParseText,
                     boolean withParseData,
                     boolean autoFix)
              throws Exception
Open a segment for reading. When a segment is open, its total size is checked and cached in this class - however, only by actually reading entries one can be sure about the exact number of valid, non-corrupt entries.

If the segment was created with no-parse option (see FetcherOutput.DIR_NAME_NP) then automatically withParseText and withParseData will be forced to false.

Parameters:
nfs - NutchFileSystem to use
dir - directory containing segment data
withContent - if true, read Content, otherwise ignore it
withParseText - if true, read ParseText, otherwise ignore it
withParseData - if true, read ParseData, otherwise ignore it
autoFix - if true, and the segment is corrupt, try to automatically fix it. If this parameter is false, and the segment is corrupt, or fixing was unsuccessful, and Exception is thrown.
Throws:
Exception
Method Detail

isParsedSegment

public static boolean isParsedSegment(NutchFileSystem nfs,
                                      File segdir)
                               throws Exception
Throws:
Exception

fixSegment

public static boolean fixSegment(NutchFileSystem nfs,
                                 File dir,
                                 boolean withContent,
                                 boolean withParseText,
                                 boolean withParseData,
                                 boolean dryrun)
Attempt to fix a partially corrupted segment. Currently this means just fixing broken MapFile's, using MapFile.fix(NutchFileSystem, File, Class, Class, boolean) method.

Parameters:
nfs - filesystem
dir - segment directory
withContent - if true, fix content, otherwise ignore it
withParseText - if true, fix parse_text, otherwise ignore it
withParseData - if true, fix parse_data, otherwise ignore it
dryrun - if true, only show what would be done without performing any actions
Returns:
true if segment was fixed successfully, otherwise return false.

get

public boolean get(long n,
                   FetcherOutput fo,
                   Content co,
                   ParseText pt,
                   ParseData pd)
            throws IOException
Get a specified entry from the segment. Note: even if some of the storage objects are null, but if respective readers are open a seek(n) operation will be performed anyway, to ensure that the whole entry is valid.

Parameters:
n - position of the entry
fo - storage for FetcherOutput data. Must not be null.
co - storage for Content data, or null.
pt - storage for ParseText data, or null.
pd - storage for ParseData data, or null.
Returns:
true if all requested data successfuly read, false otherwise
Throws:
IOException

next

public boolean next(FetcherOutput fo,
                    Content co,
                    ParseText pt,
                    ParseData pd)
             throws IOException
Read values from all open readers. Note: even if some of the storage objects are null, but if respective readers are open, an underlying next() operation will be performed for all streams anyway, to ensure that the whole entry is valid.

Throws:
IOException

seek

public void seek(long n)
          throws IOException
Seek to a position in all readers.

Throws:
IOException

key

public long key()
Return the current key position.


reset

public void reset()
           throws IOException
Reset all readers.

Throws:
IOException

close

public void close()
Close all readers.


dump

public void dump(boolean sorted,
                 PrintStream output)
          throws Exception
Dump the segment's content in human-readable format.

Parameters:
sorted - if true, sort segment entries by URL (ascending). If false, output entries in the order they occur in the segment.
output - where to dump to
Throws:
Exception

main

public static void main(String[] args)
                 throws Exception
Command-line wrapper. Run without arguments to see usage help.

Throws:
Exception


Copyright © 2006 The Apache Software Foundation