Zend_Search_Lucene is designed to work with UTF-8 charset. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports it with one exception. [6]
However, text analyzers and query parser text analyzer and query parser use ctype_alpha() for tokenizing text and queries. ctype_alpha() doesn't support UTF-8 and needs to be replaced by something else in nearest future.
Before that we are strongly recomend to convert your data into ASCII representation [7] (both for storing source documents, and for querying):
<?php $doc = new Zend_Search_Lucene_Document(); ... $docText = iconv('ISO-8859-1', 'ASCII//TRANSLIT', $docText); $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docText)); ... $query = iconv('', 'ASCII//TRANSLIT', $query); $hits = $index->find($query); ?>
[6] Zend_Search_Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support "supplementary characters" (characters whose code points are greater than 0xFFFF)
Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.
[7] If data could contain non-ascii character or come in UTF-8.