|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.nutch.parse.html.DOMContentUtils
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
Nested Class Summary | |
static class |
DOMContentUtils.LinkParams
|
Field Summary | |
static HashMap |
linkParams
|
Constructor Summary | |
DOMContentUtils()
|
Method Summary | |
static URL |
getBase(Node node)
If Node contains a BASE tag then it's HREF is returned. |
static void |
getOutlinks(URL base,
ArrayList outlinks,
Node node)
This method finds all anchors below the supplied DOM node , and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList . |
static void |
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false) . |
static boolean |
getText(StringBuffer sb,
Node node,
boolean abortOnNestedAnchors)
This method takes a StringBuffer and a DOM Node ,
and will append all the content text found beneath the DOM node to
the StringBuffer . |
static boolean |
getTitle(StringBuffer sb,
Node node)
This method takes a StringBuffer and a DOM Node ,
and will append the content text found beneath the first
title node to the StringBuffer . |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static HashMap linkParams
Constructor Detail |
public DOMContentUtils()
Method Detail |
public static final boolean getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)
StringBuffer
and a DOM Node
,
and will append all the content text found beneath the DOM node to
the StringBuffer
.
If abortOnNestedAnchors
is true, DOM traversal will
be aborted and the StringBuffer
will not contain
any text encountered after a nested anchor is found.
public static final void getText(StringBuffer sb, Node node)
getText(sb, node, false)
.
public static final boolean getTitle(StringBuffer sb, Node node)
StringBuffer
and a DOM Node
,
and will append the content text found beneath the first
title
node to the StringBuffer
.
public static final URL getBase(Node node)
public static final void getOutlinks(URL base, ArrayList outlinks, Node node)
node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |