org.apache.nutch.net
Class RegexUrlNormalizer

java.lang.Object
  extended byorg.apache.nutch.net.BasicUrlNormalizer
      extended byorg.apache.nutch.net.RegexUrlNormalizer
All Implemented Interfaces:
UrlNormalizer

public class RegexUrlNormalizer
extends BasicUrlNormalizer
implements UrlNormalizer

Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.

This class must be specified as the URL normalizer to be used in nutch-site.xml or nutch-default.xml. To do this specify the urlnormalizer.class property to have the value: org.apache.nutch.net.RegexUrlNormalizer. The urlnormalizer.regex.file property should also be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.

Author:
Luke Baker

Field Summary
 
Fields inherited from class org.apache.nutch.net.BasicUrlNormalizer
LOG
 
Constructor Summary
RegexUrlNormalizer()
          Default constructor which gets the file name from either nutch-site.xml or nutch-default.xml and reads that configuration file.
RegexUrlNormalizer(String filename)
          Constructor which can be passed the file name, so it doesn't look in the configuration files for it.
 
Method Summary
static void main(String[] args)
          Spits out patterns and substitutions that are in the configuration file.
 String normalize(String urlString)
          Normalizes any URLs by calling super.basicNormalize() and regexSub().
 String regexNormalize(String urlString)
          This function does the replacements by iterating through all the regex patterns.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RegexUrlNormalizer

public RegexUrlNormalizer()
                   throws IOException,
                          org.apache.oro.text.regex.MalformedPatternException
Default constructor which gets the file name from either nutch-site.xml or nutch-default.xml and reads that configuration file. It stores the regex patterns and corresponding substitutions in a List. The file should be in the CLASSPATH.


RegexUrlNormalizer

public RegexUrlNormalizer(String filename)
                   throws IOException,
                          org.apache.oro.text.regex.MalformedPatternException
Constructor which can be passed the file name, so it doesn't look in the configuration files for it.

Method Detail

regexNormalize

public String regexNormalize(String urlString)
This function does the replacements by iterating through all the regex patterns. It accepts a string url as input and returns the altered string.


normalize

public String normalize(String urlString)
                 throws MalformedURLException
Normalizes any URLs by calling super.basicNormalize() and regexSub(). This is the function that gets called elsewhere in Nutch.

Specified by:
normalize in interface UrlNormalizer
Overrides:
normalize in class BasicUrlNormalizer
Throws:
MalformedURLException

main

public static void main(String[] args)
                 throws org.apache.oro.text.regex.MalformedPatternException,
                        IOException
Spits out patterns and substitutions that are in the configuration file.

Throws:
org.apache.oro.text.regex.MalformedPatternException
IOException


Copyright © 2006 The Apache Software Foundation