org.apache.nutch.analysis
Class NutchDocumentTokenizer
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.nutch.analysis.NutchDocumentTokenizer
- All Implemented Interfaces:
- NutchAnalysisConstants
- public final class NutchDocumentTokenizer
- extends Tokenizer
- implements NutchAnalysisConstants
The tokenizer used for Nutch document text. Implemented in terms of our
JavaCC-generated lexical analyzer, NutchAnalysisTokenManager
, shared
with the query parser.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Fields inherited from interface org.apache.nutch.analysis.NutchAnalysisConstants |
ACRONYM, APOSTROPHE, ATSIGN, C_PLUS_PLUS, C_SHARP, CJK, COLON, DEFAULT, DIGIT, DOT, EOF, IRREGULAR_WORD, LETTER, MINUS, PLUS, QUOTE, SIGRAM, SLASH, tokenImage, WHITE, WORD, WORD_PUNCT |
Method Summary |
static void |
main(String[] args)
For debugging. |
Token |
next()
Returns the next token in the stream, or null at EOF. |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NutchDocumentTokenizer
public NutchDocumentTokenizer(Reader reader)
- Construct a tokenizer for the text in a Reader.
next
public final Token next()
throws IOException
- Returns the next token in the stream, or null at EOF.
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- For debugging.
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation