org.apache.nutch.parse.js
Class JSParseFilter

java.lang.Object
  extended byorg.apache.nutch.parse.js.JSParseFilter
All Implemented Interfaces:
HtmlParseFilter, Parser

public class JSParseFilter
extends Object
implements HtmlParseFilter, Parser

This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java by Stephan Strittmatter.

Author:
Andrzej Bialecki <[email protected]>

Field Summary
static Logger LOG
           
 
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
 
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
 
Constructor Summary
JSParseFilter()
           
 
Method Summary
 Parse filter(Content content, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
 Parse getParse(Content c)
          Creates the parse for some content.
static void main(String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG
Constructor Detail

JSParseFilter

public JSParseFilter()
Method Detail

filter

public Parse filter(Content content,
                    Parse parse,
                    HTMLMetaTags metaTags,
                    DocumentFragment doc)
Description copied from interface: HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.

Specified by:
filter in interface HtmlParseFilter

getParse

public Parse getParse(Content c)
Description copied from interface: Parser
Creates the parse for some content.

Specified by:
getParse in interface Parser

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2006 The Apache Software Foundation