org.apache.nutch.db
Interface IWebDBReader

All Known Implementing Classes:
DistributedWebDBReader, WebDBReader

public interface IWebDBReader

IWebDBReader is an interface to the consolidated page/link database. It permits all kind of read-only ops. This database may be implemented in several different ways, which this interface hides from its user.

Author:
Mike Cafarella

Method Summary
 void close()
          Done reading.
 Link[] getLinks(MD5Hash md5)
          Return all the Link objects that originate from a document with the given MD5 checksum.
 Link[] getLinks(UTF8 url)
          Return any Link objects that point to the given URL.
 Page getPage(String url)
          Return a Page object with the given URL, if any.
 Page[] getPages(MD5Hash md5)
          Return any Pages with the given MD5 checksum.
 Enumeration links()
          Obtain an Enumeration of all Link objects, sorted by target URL.
 long numLinks()
          Simple count of all Link objects in db.
 long numPages()
          Simple count of all Page objects in db.
 boolean pageExists(MD5Hash md5)
          Returns whether a Page with the given MD5 checksum is in the db.
 Enumeration pages()
          Obtain an Enumeration of all Page objects, sorted by URL
 Enumeration pagesByMD5()
          Obtain an Enumeration of all Page objects, sorted by MD5.
 

Method Detail

close

public void close()
           throws IOException
Done reading. Release a handle on the db.

Throws:
IOException

getPage

public Page getPage(String url)
             throws IOException
Return a Page object with the given URL, if any. Pages are guaranteed to be unique by URL, so there can be max. 1 returned object.

Throws:
IOException

getPages

public Page[] getPages(MD5Hash md5)
                throws IOException
Return any Pages with the given MD5 checksum. Pages with different URLs often have identical checksums; this can happen if the content has been copied, or a site is available under several different URLs.

Throws:
IOException

pageExists

public boolean pageExists(MD5Hash md5)
                   throws IOException
Returns whether a Page with the given MD5 checksum is in the db.

Throws:
IOException

pages

public Enumeration pages()
                  throws IOException
Obtain an Enumeration of all Page objects, sorted by URL

Throws:
IOException

pagesByMD5

public Enumeration pagesByMD5()
                       throws IOException
Obtain an Enumeration of all Page objects, sorted by MD5.

Throws:
IOException

numPages

public long numPages()
Simple count of all Page objects in db.


getLinks

public Link[] getLinks(UTF8 url)
                throws IOException
Return any Link objects that point to the given URL. This array can be very large if the given URL has lots of incoming Links. So large, in fact, that this method call will probably kill the process for certain URLs.

Throws:
IOException

getLinks

public Link[] getLinks(MD5Hash md5)
                throws IOException
Return all the Link objects that originate from a document with the given MD5 checksum. These will be the outlinks for the page of content described.

Throws:
IOException

links

public Enumeration links()
                  throws IOException
Obtain an Enumeration of all Link objects, sorted by target URL.

Throws:
IOException

numLinks

public long numLinks()
Simple count of all Link objects in db.



Copyright © 2006 The Apache Software Foundation