12.5. Character set.

12.5.1. UTF-8 and single-byte character sets support.

Zend_Search_Lucene is designed to work with UTF-8 charset. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports it with one exception. [6]

However, text analyzers and query parser text analyzer and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future.

Before that we are strongly recomend to convert your data into ASCII representation [7] (both for storing source documents, and for querying):

<?php
$doc = new Zend_Search_Lucene_Document();
...
$docText = iconv('ISO-8859-1', 'ASCII//TRANSLIT', $docText);
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docText));

...

$query = iconv('', 'ASCII//TRANSLIT', $query);
$hits = $index->find($query);
?>


[6] Zend_Search_Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support "supplementary characters" (characters whose code points are greater than 0xFFFF)

Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.

[7] If data could contain non-ascii character or come in UTF-8.