12.3. Searching an Index

12.3.1. Building Queries

There are two ways to search the index. The first method uses Query Parser to construct query from a string. The second provides the ability to create your own queries through the Zend_Search_Lucene API.

Before choosing to use the provided Query Parser, please consider the following:

  1. If you are programmatically generating a query string and then parsing it with the query parser then you should seriously consider building your queries directly with the query API. In other words, the query parser is designed for human-entered text, not for program-generated text.
  2. Untokenized fields are best added directly to queries, and not through the query parser. If a field's values are generated programmatically by the application, then so should query clauses for this field. An analyzer, which the query parser uses, is designed to convert human-entered text to terms. Program-generated values, like dates, keywords, etc., should be consistently program-generated.
  3. In a query form, fields which are general text should use the query parser. All others, such as date ranges, keywords, etc. are better added directly through the query API. A field with a limit set of values, that can be specified with a pull-down menu should not be added to a query string which is subsequently parsed, but rather added as a TermQuery clause.

Both ways use the same API method to search through the index:

<?php

require_once('Zend/Search/Lucene.php');

$index = new Zend_Search_Lucene('/data/my_index');

$index->find($query);

?>

The Zend_Search_Lucene::find() method determines input type automatically and uses query parser to construct appropriate Zend_Search_Lucene_Search_Query object from a string.

It is important to note that find() IS case sensitive. By default, LuceneIndexCreation.jar normalizes all documents to lowercase. This can be turned off with a command line switch (type LuceneIndexCreation.jar with no arguments for help). The case of the text supplied to find() must match that of the index. If the index is normalized to lowercase, then all text supplied to find() must pass through strtolower(), or else it may not match.

12.3.2. Search Results

The search result is an array of Zend_Search_Lucene_Search_QueryHit objects. Each of these has two properties: $hit->document is a document number within the index and $hit->score is a score of the hit in a search result. Result is ordered by score (top scores come first).

The Zend_Search_Lucene_Search_QueryHit object also exposes each field of the Zend_Search_Lucene_Document found by the hit as a property of the hit. In this example, a hit is returned and the corresponding document has two fields: title and author.

<?php

require_once('Zend/Search/Lucene.php');

$index = new Zend_Search_Lucene('/data/my_index');

$hits = $index->find($query);

foreach ($hits as $hit) {
    echo $hit->id;
    echo $hit->score;

    echo $hit->title;
    echo $hit->author;
}

?>

Optionally, the original Zend_Search_Lucene_Document object can be returned from the Zend_Search_Lucene_Search_QueryHit. You can retrieve indexed parts of the document by using the getDocument() method of the index object and then get them by getFieldValue() method:

<?php

require_once('Zend/Search/Lucene.php');

$index = new Zend_Search_Lucene('/data/my_index');

$hits = $index->find($query);
foreach ($hits as $hit) {
    // return Zend_Search_Lucene_Document object for this hit
    echo $document = $hit->getDocument();

    // return a Zend_Search_Lucene_Field object
    // from the Zend_Search_Lucene_Document
    echo $document->getField('title');

    // return the string value of the Zend_Search_Lucene_Field object
    echo $document->getFieldValue('title');

    // same as getFieldValue()
    echo $document->title;
}

?>

The fields available from the Zend_Search_Lucene_Document object are determined at the time of indexing. The document fields are either indexed, or index and stored, in the document by the indexing application (e.g. LuceneIndexCreation.jar).

Pay attention, that document identity ('path' in our example) is also stored in the index and must be retrieved from them.

12.3.3. Results Scoring

Zend_Search_Lucene uses the same scoring algorithms as Java Lucene. Saerch results are ordered by score. Hits with greater score come first.

Different score means, that one document matches the query more then another.

Roughly speaking, search hits, which contain searched term or phrase more frequently, have greater score.

Score can be retrived by score property of hit:

<?php
$hits = $index->find($query);

foreach ($hits as $hit) {
    echo $hit->id;
    echo $hit->score;
}

?>

Zend_Search_Lucene_Search_Similarity class is used to calculate score. See Extensibility. Scoring Algorithms section for details.