Глава 14. Zend_Search

Содержание

14.1. Overview
14.1.1. Introduction
14.1.2. Document and Field Objects
14.1.3. Understanding Field Types
14.2. Building Indexes
14.2.1. Creating a New Index
14.2.2. Updating Index
14.3. Searching an Index
14.3.1. Building Queries
14.3.2. Search Results
14.3.3. Results Scoring
14.4. Query Types
14.4.1. Term Query
14.4.2. Multi-Term Query
14.4.3. Phrase Query
14.5. Character set.
14.5.1. UTF-8 and single-byte character sets support.
14.6. Extensibility
14.6.1. Text Analysis
14.6.2. Scoring Algorithms
14.6.3. Storage Containers
14.7. Interoperating with Java Lucene
14.7.1. File Formats
14.7.2. Index Directory
14.7.3. Java Source Code
14.7.4. Using LuceneIndexCreation.jar

14.1. Overview

14.1.1. Introduction

Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. Since it stores its index on the filesystem and does not require a database server, it can add search capabilities to almost any PHP-driven website. Zend_Search_Lucene supports the following features:

  • Ranked searching - best results returned first

  • Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more [5]

  • Search by specific field (e.g., title, author, contents)

Zend_Search_Lucene was derived from the Apache Lucene project. For more information on Lucene, visit http://lucene.apache.org/java/docs/.

14.1.2. Document and Field Objects

Zend_Search_Lucene operates with documents as atomic subjects for indexing. A document is divided into named fields, and fields have content that can be searched.

A documented is represented by the Zend_Search_Lucene_Document object, and this object contains Zend_Search_Lucene_Field objects that represent the fields.

It is important to note that any kind of information can be added to the index. Application-specific information or metadata can be stored in the document fields, and later retrieved with the document during search.

It is the responsibility of your application to control the indexer. This means that data can be indexed from any source that is accessible by your application. For example, this could be the filesystem, a database, an HTML form, etc.

Zend_Search_Lucene_Field class provides several static methods to create fields with different characteristics:

<?php
$doc = new Zend_Search_Lucene_Document();

// Field is not tokenized, but is indexed and stored within the index.
// Stored fields can be retrived from the index.
$doc->addField(Zend_Search_Lucene_Field::Keyword('doctype', 
                                                 'autogenerated'));

// Field is not tokenized nor indexed, but is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', 
                                                   time()));

// Binary String valued Field that is not tokenized nor indexed,
// but is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Binary('icon', 
                                                $iconData));

// Field is tokenized and indexed, and is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Text('annotation', 
                                              'Document annotation text'));

// Field is tokenized and indexed, but that is not stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', 
                                                  'My document content'));

?>

You could give names for fields by your own choice. A "contents" field name is used to search by default. It's good idea to place main document data into this field with this name.

14.1.3. Understanding Field Types

  • Keyword fields are stored and indexed, meaning they can be searched as well as displayed them back in search results. They are not split up into seperate words by tokenization. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.

  • UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields.

  • Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.

  • Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results.

  • UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a seperate fields as an identifier.

    Таблица 14.1. Zend_Search_Lucene_Field Types

    Field Type Stored Indexed Tokenized Binary
    Keyword Yes Yes No No
    UnIndexed Yes No No No
    Binary Yes No No Yes
    Text Yes Yes Yes No
    UnStored No Yes Yes No


[5] Only term and multi term queries are supported at this time.