15.7. Interoperating with Java Lucene

15.7.1. File Formats

Zend_Search_Lucene index file formats are binary compatible with a Lucene version 1.4 and above.

A detailed description of this format is available here: http://lucene.apache.org/java/docs/fileformats.html.

15.7.2. Index Directory

After index creation, the index directory will contain several files:

  • segments file is a list of index segments.

  • *.cfs files contain index segments. Note! Optimized index has always only one segment.

  • deletable file is a list of files that are no longer used by the index, but which could not be deleted.

15.7.3. Java Source Code

The Java program listing below provides an example of how to index a file using Java Lucene:

/**
* Index creation:
*/
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.*;

import java.io.*

...

IndexWriter indexWriter = new IndexWriter("/data/my_index", 
                                          new SimpleAnalyzer(), true);

...

String filename = "/path/to/file-to-index.txt"
File f = new File(filename);

Document doc = new Document();
doc.add(Field.Text("path", filename));
doc.add(Field.Keyword("modified",DateField.timeToString(f.lastModified())));
doc.add(Field.Text("author", "unknown"));
FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text("contents", reader));

indexWriter.addDocument(doc);
        

15.7.4. Using LuceneIndexCreation.jar

To get started with Zend_Search_Lucene quickly, a JAR file (Java) has been created to generate an index from the command line. For more information on JAR files, please visit: http://java.sun.com/docs/books/tutorial/jar/basics/index.html.

LuceneIndexCreation.jar consumes text files and builds an index from them. Usage:

    java -jar LuceneIndexCreation.jar [-c] [-s] <document_dir> <index_dir>
    -c   - force index to be case sensitive
    -s   - store content in the index
    

This command consumes the directory <document_dir>, including all of its subdirectories, and produces a Lucene index. The index is a set of files that will be stored in a separate directory that is specified by <index_dir>.

For each document to be indexed, LuceneIndexCreation creates a document object with three fields: a contents field containing the contents (body) of the document, a modified field containing the file modification time, and the full path and filename in a path field.

If -c is specified, then index is forced to be case sensitive. Otherwise all terms are converted to lower case before to be added into the index.

If -s is specified, then document's content is also stored in the index and can be retrieved with path and modified fields.

Otherwise only path and modified fields are stored, and contents field is only indexed. In this case document content must be retrieved from an original source by its path.

Please be careful, using -s option increases index size near five times.