org.apache.nutch.analysis
Class NutchDocumentTokenizer

java.lang.Object
  extended byorg.apache.lucene.analysis.TokenStream
      extended byorg.apache.lucene.analysis.Tokenizer
          extended byorg.apache.nutch.analysis.NutchDocumentTokenizer
All Implemented Interfaces:
NutchAnalysisConstants

public final class NutchDocumentTokenizer
extends Tokenizer
implements NutchAnalysisConstants

The tokenizer used for Nutch document text. Implemented in terms of our JavaCC-generated lexical analyzer, NutchAnalysisTokenManager, shared with the query parser.


Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Fields inherited from interface org.apache.nutch.analysis.NutchAnalysisConstants
ACRONYM, APOSTROPHE, ATSIGN, C_PLUS_PLUS, C_SHARP, CJK, COLON, DEFAULT, DIGIT, DOT, EOF, IRREGULAR_WORD, LETTER, MINUS, PLUS, QUOTE, SIGRAM, SLASH, tokenImage, WHITE, WORD, WORD_PUNCT
 
Constructor Summary
NutchDocumentTokenizer(Reader reader)
          Construct a tokenizer for the text in a Reader.
 
Method Summary
static void main(String[] args)
          For debugging.
 Token next()
          Returns the next token in the stream, or null at EOF.
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NutchDocumentTokenizer

public NutchDocumentTokenizer(Reader reader)
Construct a tokenizer for the text in a Reader.

Method Detail

next

public final Token next()
                 throws IOException
Returns the next token in the stream, or null at EOF.

Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
For debugging.

Throws:
Exception


Copyright © 2006 The Apache Software Foundation