org.apache.nutch.analysis.lang
Class HTMLLanguageParser
java.lang.Object
org.apache.nutch.analysis.lang.HTMLLanguageParser
- All Implemented Interfaces:
- HtmlParseFilter
- public class HTMLLanguageParser
- extends Object
- implements HtmlParseFilter
An HtmlParseFilter
that looks for possible
indications of content language.
If some indication is found, it is added in the META_LANG_NAME
attribute of the ParseData
metadata.
- Author:
- Sami Siren, Jérôme Charron
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
META_LANG_NAME
public static final String META_LANG_NAME
- The language meta data attribute name
- See Also:
- Constant Field Values
HTMLLanguageParser
public HTMLLanguageParser()
filter
public Parse filter(Content content,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
- Scan the HTML document looking at possible indications of content language.
- html lang attribute
(
http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1),
- meta dc.language (
http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml#language),
- meta http-equiv (content-language) (
http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2).
Only the first occurence of language is stored.
- Specified by:
filter
in interface HtmlParseFilter
Copyright © 2006 The Apache Software Foundation