org.apache.nutch.indexer
Class DeleteDuplicates

java.lang.Object
  extended byorg.apache.nutch.indexer.DeleteDuplicates

public class DeleteDuplicates
extends Object

Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.

Author:
Doug Cutting, Mike Cafarella

Nested Class Summary
static class DeleteDuplicates.IndexedDoc
          The key used in sorting for duplicates.
 
Constructor Summary
DeleteDuplicates(IndexReader[] readers, File workingDir)
          Constructs a duplicate detector for the provided indexes.
 
Method Summary
 void close()
          Closes the indexes, saving changes.
 void deleteContentDuplicates()
          Delete pages with duplicate content hashes.
 void deleteUrlDuplicates()
          Delete pages with duplicate URLs.
static void main(String[] args)
          Delete duplicates in the indexes in the named directory.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DeleteDuplicates

public DeleteDuplicates(IndexReader[] readers,
                        File workingDir)
                 throws IOException
Constructs a duplicate detector for the provided indexes.

Method Detail

close

public void close()
           throws IOException
Closes the indexes, saving changes.

Throws:
IOException

deleteContentDuplicates

public void deleteContentDuplicates()
                             throws IOException
Delete pages with duplicate content hashes. Of those with the same content hash, keep the page with the highest score.

Throws:
IOException

deleteUrlDuplicates

public void deleteUrlDuplicates()
                         throws IOException
Delete pages with duplicate URLs. Of those with the same URL, keep the most recently fetched page.

Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Delete duplicates in the indexes in the named directory.

Throws:
Exception


Copyright © 2006 The Apache Software Foundation