14.2. Building Indexes

14.2.1. Creating a New Index

Index creation and updating capabilities are implemented within Zend_Search_Lucene module and Java Lucene. You can use both of these capabilities.

The PHP code listing below provides an example of how to index a file using Zend_Search_Lucene indexing API:

<?php

// Setting the second argument to TRUE creates a new index
$index = new Zend_Search_Lucene('/data/my-index', true);

$doc = new Zend_Search_Lucene_Document();

// Store document URL to identify it in search result.
$doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));

// Index document content
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));

// Add document to the index.
$index->addDocument($doc);

// Write changes to the index.
$index->commit();
?>

Newly added documents could be retrieved from the index after commit operation.

Zend_Search_Lucene::commit() is automatically called at the end of script execution and before any search request.

Each commit() call generates new index segment. [6] So it must be requested as rarely as possible. From the other side committing large amount of documents in one step needs more memory.

Automatic segment management optimization is a subject of future Zend_Search_Lucene enhancements.

14.2.2. Updating Index

The same procedure is used to update an existing index. The only difference is that the index should be opened without the second parameter:

<?php

// Open existing index
$index = new Zend_Search_Lucene('/data/my-index');

$doc = new Zend_Search_Lucene_Document();
// Store document URL to identify it in search result.
$doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
// Index document content
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));

// Add document to the index.
$index->addDocument($doc);

// Write changes to the index.
$index->commit();
?>

Each commit() call (explicit or implicit) generates new index segment.

Zend_Search_Lucene doesn't manage segments automatically. Thus you should care about segment size. From the one side large segment is more optimal, but from another large segment needs more memory during creation.

Lucene Java and Luke (Lucene Index Toolbox - http://www.getopt.org/luke/) can be used to optimize index with this version of Zend_Search_Lucene.

14.2.3. Updating Documents

Lucene index file format doesn't support document updating. Document should be removed and re-added to do this.

Zend_Search_Lucene::delete() method operates with an internal index document id. It can be retrieved from a query hit by 'id' property:

<?php
$removePath = ...;
$hits = $index->find('path:' . $removePath);
foreach ($hits as $hit) {
    $index->delete($hit->id);
}
$index->commit();
?>


[6] Lucene index segment files cannot be updated by their nature. A segment update needs full segment reorganisation. See Lucene index file formats for details (http://lucene.apache.org/java/docs/fileformats.html). Increasing number of segments reduces quality of the index, but index optimization restores it. Optimization is reduced to merging several segments into one. This process also doesn't update segments. It generates new large segment, generates new segment list ('segments.new' file), which contains new optimized segment instead of the set of old segments, and then renames 'segments.new' file to 'segments'.

Optimization is iterative process. Very small segments (ex. which are generated by adding only one document) are being merged into greater and so on. Optimization can work with a segment streams and doesn't take a lot of memory. Thus optimization process doesn't take a lot of resources and doesn't lock index for searching, updating or merging other segments.