DOMContentUtils (Nutch 0.7.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.parse.html
Class DOMContentUtils

java.lang.Object
  org.apache.nutch.parse.html.DOMContentUtils

public class DOMContentUtils
extends Object

A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.

Nested Class Summary
`static class`	`DOMContentUtils.LinkParams`

Field Summary
`static HashMap`	`linkParams`

Constructor Summary
`DOMContentUtils()`

Method Summary
`static URL`	`getBase(Node node)` If Node contains a BASE tag then it's HREF is returned.
`static void`	`getOutlinks(URL base, ArrayList outlinks, Node node)` This method finds all anchors below the supplied DOM `node`, and creates appropriate `Outlink` records for each (relative to the supplied `base` URL), and adds them to the `outlinks` `ArrayList`.
`static void`	`getText(StringBuffer sb, Node node)` This is a convinience method, equivalent to `getText(sb, node, false)`.
`static boolean`	`getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)` This method takes a `StringBuffer` and a DOM `Node`, and will append all the content text found beneath the DOM node to the `StringBuffer`.
`static boolean`	`getTitle(StringBuffer sb, Node node)` This method takes a `StringBuffer` and a DOM `Node`, and will append the content text found beneath the first `title` node to the `StringBuffer`.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

linkParams

public static HashMap linkParams

Constructor Detail

DOMContentUtils

public DOMContentUtils()

Method Detail

getText

public static final boolean getText(StringBuffer sb,
                                    Node node,
                                    boolean abortOnNestedAnchors)

This method takes a StringBuffer and a DOM Node, and will append all the content text found beneath the DOM node to the StringBuffer.

If abortOnNestedAnchors is true, DOM traversal will be aborted and the StringBuffer will not contain any text encountered after a nested anchor is found.

Returns:: true if nested anchors were found

getText

public static final void getText(StringBuffer sb,
                                 Node node)

This is a convinience method, equivalent to getText(sb, node, false).

getTitle

public static final boolean getTitle(StringBuffer sb,
                                     Node node)

This method takes a StringBuffer and a DOM Node, and will append the content text found beneath the first title node to the StringBuffer.

Returns:: true if a title node was found, false otherwise

getBase

public static final URL getBase(Node node)

If Node contains a BASE tag then it's HREF is returned.

getOutlinks

public static final void getOutlinks(URL base,
                                     ArrayList outlinks,
                                     Node node)

This method finds all anchors below the supplied DOM node, and creates appropriate Outlink records for each (relative to the supplied base URL), and adds them to the outlinks ArrayList.

Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.parse.html Class DOMContentUtils

linkParams

DOMContentUtils

getText

getText

getTitle

getBase

getOutlinks

org.apache.nutch.parse.html
Class DOMContentUtils