org.apache.nutch.analysis.lang
Class HTMLLanguageParser

java.lang.Object
  extended byorg.apache.nutch.analysis.lang.HTMLLanguageParser
All Implemented Interfaces:
HtmlParseFilter

public class HTMLLanguageParser
extends Object
implements HtmlParseFilter

An HtmlParseFilter that looks for possible indications of content language. If some indication is found, it is added in the META_LANG_NAME attribute of the ParseData metadata.

Author:
Sami Siren, Jérôme Charron

Field Summary
static String META_LANG_NAME
          The language meta data attribute name
 
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
 
Constructor Summary
HTMLLanguageParser()
           
 
Method Summary
 Parse filter(Content content, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Scan the HTML document looking at possible indications of content language.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

META_LANG_NAME

public static final String META_LANG_NAME
The language meta data attribute name

See Also:
Constant Field Values
Constructor Detail

HTMLLanguageParser

public HTMLLanguageParser()
Method Detail

filter

public Parse filter(Content content,
                    Parse parse,
                    HTMLMetaTags metaTags,
                    DocumentFragment doc)
Scan the HTML document looking at possible indications of content language.
  1. html lang attribute ( http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1),
  2. meta dc.language ( http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml#language),
  3. meta http-equiv (content-language) ( http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2).
Only the first occurence of language is stored.

Specified by:
filter in interface HtmlParseFilter


Copyright © 2006 The Apache Software Foundation