org.apache.nutch.db
Class WebDBInjector

java.lang.Object
  extended byorg.apache.nutch.db.WebDBInjector

public class WebDBInjector
extends Object

This class takes a flat file of URLs and adds them as entries into a pagedb. Useful for bootstrapping the system.

Author:
Mike Cafarella, Doug Cutting

Field Summary
static Logger LOG
           
 
Constructor Summary
WebDBInjector(IWebDBWriter dbWriter)
          WebDBInjector takes a reference to a WebDBWriter that it should add to.
 
Method Summary
 boolean addPage(String url)
          Add one page to WebDB.
 void close()
          Close dbWriter and save changes
 void injectDmozFile(File dmozFile, int subsetDenom, boolean includeAdult, boolean includeDmozDesc, int skew, Pattern topicPattern)
          Iterate through all the items in this structured DMOZ file.
 void injectURLFile(File urlList)
          Iterate through all the items in this flat text file and add them to the db.
static void main(String[] argv)
          Command-line access.
 void printStatus()
          Utility to present performance stats
 void printStatusBar(int small, int big)
          Utility to present small status bar
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final Logger LOG
Constructor Detail

WebDBInjector

public WebDBInjector(IWebDBWriter dbWriter)
WebDBInjector takes a reference to a WebDBWriter that it should add to.

Method Detail

close

public void close()
           throws IOException
Close dbWriter and save changes

Throws:
IOException

printStatusBar

public void printStatusBar(int small,
                           int big)
Utility to present small status bar


printStatus

public void printStatus()
Utility to present performance stats


injectURLFile

public void injectURLFile(File urlList)
                   throws IOException
Iterate through all the items in this flat text file and add them to the db.

Throws:
IOException

injectDmozFile

public void injectDmozFile(File dmozFile,
                           int subsetDenom,
                           boolean includeAdult,
                           boolean includeDmozDesc,
                           int skew,
                           Pattern topicPattern)
                    throws IOException,
                           SAXException,
                           ParserConfigurationException
Iterate through all the items in this structured DMOZ file. Add each URL to the web db.

Throws:
IOException
SAXException
ParserConfigurationException

addPage

public boolean addPage(String url)
                throws IOException
Add one page to WebDB. Changes are not saved to the db until the close() is invoked. URLs are checked with the URLFilter, and only those that pass are added.

Parameters:
url - URL to be added
Returns:
true on success, false otherwise (e.g. filtered out or bad URL syntax).
Throws:
IOException

main

public static void main(String[] argv)
                 throws Exception
Command-line access. User may add URLs via a flat text file or the structured DMOZ file. By default, we ignore Adult material (as categorized by DMOZ).

Throws:
Exception


Copyright © 2006 The Apache Software Foundation