org.apache.nutch.db
Class DistributedWebDBReader

java.lang.Object
  extended byorg.apache.nutch.db.DistributedWebDBReader
All Implemented Interfaces:
IWebDBReader

public class DistributedWebDBReader
extends Object
implements IWebDBReader

The WebDBReader implements all the read-only parts of accessing our web database. All the writing ones can be found in WebDBWriter.

Author:
Mike Cafarella

Constructor Summary
DistributedWebDBReader(NutchFileSystem nfs, File root)
          Open a web db reader for the named directory.
 
Method Summary
 void close()
          Shutdown
 Link[] getLinks(MD5Hash md5)
          Grab all the links from the given MD5 hash.
 Link[] getLinks(UTF8 url)
          Get all the hyperlinks that link TO the indicated URL.
 Page getPage(String url)
          Get Page from the pagedb with the given URL.
 Page[] getPages(MD5Hash md5)
          Get all the Pages according to their content hash.
 Enumeration links()
          Return all the links, by target URL
static void main(String[] argv)
          The DistributedWebDBReader.main() provides some handy utility methods for looking through the contents of the webdb.
 long numLinks()
          Return the number of links in our db.
 int numMachines()
          How many sections (machines) there are in this distributed db.
 long numPages()
          Return the number of pages we're dealing with.
 boolean pageExists(MD5Hash md5)
          Test whether a certain piece of content is in the database, but don't bother returning the Page(s) itself.
 Enumeration pages()
          Iterate through all the Pages, sorted by URL.
 Enumeration pagesByMD5()
          Iterate through all the Pages, sorted by MD5.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DistributedWebDBReader

public DistributedWebDBReader(NutchFileSystem nfs,
                              File root)
                       throws IOException,
                              FileNotFoundException
Open a web db reader for the named directory.

Method Detail

close

public void close()
           throws IOException
Shutdown

Specified by:
close in interface IWebDBReader
Throws:
IOException

numMachines

public int numMachines()
How many sections (machines) there are in this distributed db.


numPages

public long numPages()
Return the number of pages we're dealing with.

Specified by:
numPages in interface IWebDBReader

numLinks

public long numLinks()
Return the number of links in our db.

Specified by:
numLinks in interface IWebDBReader

getPage

public Page getPage(String url)
             throws IOException
Get Page from the pagedb with the given URL.

Specified by:
getPage in interface IWebDBReader
Throws:
IOException

getPages

public Page[] getPages(MD5Hash md5)
                throws IOException
Get all the Pages according to their content hash. Since items in the pagesByMD5 DBSectionReader array will be sorted by ascending blocks of the content hash, we know the results will come in sorted order.

Specified by:
getPages in interface IWebDBReader
Throws:
IOException

pageExists

public boolean pageExists(MD5Hash md5)
                   throws IOException
Test whether a certain piece of content is in the database, but don't bother returning the Page(s) itself. We need to test every DBSectionReader in pagesByMD5 until we reach the end, or find a positive.

Specified by:
pageExists in interface IWebDBReader
Throws:
IOException

pages

public Enumeration pages()
                  throws IOException
Iterate through all the Pages, sorted by URL. We need to enumerate all the Enumerations given to us via a call to pages() for each DBSectionReader.

Specified by:
pages in interface IWebDBReader
Throws:
IOException

pagesByMD5

public Enumeration pagesByMD5()
                       throws IOException
Iterate through all the Pages, sorted by MD5. We enumerate all the DBSectionReader Enumerations, just as above.

Specified by:
pagesByMD5 in interface IWebDBReader
Throws:
IOException

getLinks

public Link[] getLinks(UTF8 url)
                throws IOException
Get all the hyperlinks that link TO the indicated URL.

Specified by:
getLinks in interface IWebDBReader
Throws:
IOException

getLinks

public Link[] getLinks(MD5Hash md5)
                throws IOException
Grab all the links from the given MD5 hash.

Specified by:
getLinks in interface IWebDBReader
Throws:
IOException

links

public Enumeration links()
                  throws IOException
Return all the links, by target URL

Specified by:
links in interface IWebDBReader
Throws:
IOException

main

public static void main(String[] argv)
                 throws FileNotFoundException,
                        IOException
The DistributedWebDBReader.main() provides some handy utility methods for looking through the contents of the webdb. Hoo-boy! Note this only works for a completely-NFS deployment.

Throws:
FileNotFoundException
IOException


Copyright © 2006 The Apache Software Foundation