org.apache.nutch.tools
Class ParseSegment

java.lang.Object
  extended byorg.apache.nutch.tools.ParseSegment

public class ParseSegment
extends Object

Parse contents in one segment.

It assumes, under given segment, existence of ./fetcher_output/, which is typically generated after a non-parsing fetcher run (i.e., fetcher is started with option -noParsing).

Contents in one segemnt are parsed and saved in these steps:

  • (1) ./fetcher_output/ and ./content/ are looped together (possibly by multiple ParserThreads), and content is parsed for each entry. The entry number and resultant ParserOutput are saved in ./parser.unsorted.
  • (2) ./parser.unsorted is sorted by entry number, result saved as ./parser.sorted.
  • (3) ./parser.sorted and ./fetcher_output/ are looped together. At each entry, ParserOutput is split into ParseDate and ParseText, which are saved in ./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput with parsing status, which is saved in ./fetcher/.

    In the end, ./fetcher/ should be identical to one resulted from fetcher run WITHOUT option -noParsing.

    By default, intermediates ./parser.unsorted and ./parser.sorted are removed at the end, unless option -noClean is used. However ./fetcher_output/ is kept intact.

    Check Fetcher.java and FetcherOutput.java for further discussion.

    Author:
    John Xing

    Field Summary
    static Logger LOG
               
     
    Constructor Summary
    ParseSegment(NutchFileSystem nfs, String directory, boolean dryRun)
              ParseSegment constructor
     
    Method Summary
    static void main(String[] args)
              main method
     void parse()
              Parse contents by multiple threads and save as unsorted ParserOutput
     void save()
              Split sorted ParserOutput into ParseData and ParseText, and generate new FetcherOutput with updated status
     void setClean(boolean clean)
              Set if clean intermediates.
    static void setLogLevel(Level level)
              Set the logging level.
     void setThreadCount(int threadCount)
              Set thread count
     void sort()
              Sort ParserOutput
     void status()
              Display the status of the parser run.
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Field Detail

    LOG

    public static final Logger LOG
    Constructor Detail

    ParseSegment

    public ParseSegment(NutchFileSystem nfs,
                        String directory,
                        boolean dryRun)
                 throws IOException
    ParseSegment constructor

    Method Detail

    setThreadCount

    public void setThreadCount(int threadCount)
    Set thread count


    setLogLevel

    public static void setLogLevel(Level level)
    Set the logging level.


    setClean

    public void setClean(boolean clean)
    Set if clean intermediates.


    status

    public void status()
    Display the status of the parser run.


    parse

    public void parse()
               throws IOException,
                      InterruptedException
    Parse contents by multiple threads and save as unsorted ParserOutput

    Throws:
    IOException
    InterruptedException

    sort

    public void sort()
              throws IOException
    Sort ParserOutput

    Throws:
    IOException

    save

    public void save()
              throws IOException
    Split sorted ParserOutput into ParseData and ParseText, and generate new FetcherOutput with updated status

    Throws:
    IOException

    main

    public static void main(String[] args)
                     throws Exception
    main method

    Throws:
    Exception


    Copyright © 2006 The Apache Software Foundation