Chapter 5. Hibernate Search: Apache Lucene™ Integration

Apache Lucene is a high-performance Java search engine library available at the Apache Software Foundation. Hibernate Annotations includes a package of annotations that allows you to mark any domain model object as indexable and have Hibernate maintain a Lucene index of any instances persisted via Hibernate. Apache Lucene is also integrated with the Hibernate query facility.

Hibernate Search is a work in progress and new features are cooking in this area. So expect some compatibility changes in subsequent versions.

5.1. Architecture

Hibernate Search is made of an indexing engine and an index search engine. Both are backed by Apache Lucene.

When an entity is inserted, updated or removed to/from the database, Hibernate Search™ will keep track of this event (through the Hibernate event system) and schedule an index update. When out of transaction, the update is executed right after the actual database operation. It is however recommended, for both your database and Hibernate Search, to execute your operation in a transaction (whether JDBC or JTA). When in a transaction, the index update is schedule for the transaction commit (and discarded in case of transaction rollback). You can think of this as the regular (infamous) autocommit vs transactional behavior. From a performance perspective, the in transaction mode is recommended. All the index updates are handled for you without you having to use the Apache Lucene APIs.

To interact with Apache Lucene indexes, Hibernate Search has the notion of DirectoryProvider. A directory provider will manage a given Lucene Directory type. You can configure directory providers to adjust the directory target.

Hibernate Search™ can also use a Lucene index to search an entity and return a (list of) managed entity saving you from the tedious Object / Lucene Document mapping and low level Lucene APIs. The application code use the unified org.hibernate.Query API exactly the way a HQL or native query would be done.

5.2. Configuration

5.2.1. Directory configuration

Apache Lucene has a notion of Directory where the index is stored. The Directory implementation can be customized but Lucene comes bundled with a file system and a full memory implementation. Hibernate Search™ has the notion of DirectoryProvider that handle the configuration and the initialization of the Lucene Directory.

Table 5.1. List of built-in Directory Providers

ClassdescriptionProperties
org.hibernate.search.store.FSDirectoryProviderFile system based directory. The directory used will be <indexBase>/<@Indexed.name>indexBase: Base directory
org.hibernate.search.store.RAMDirectoryProviderMemory based directory, the directory will be uniquely indentified by the @Indexed.name elementnone

If the built-in directory providers does not fit your needs, you can write your own directory provider by implementing the org.hibernate.store.DirectoryProvider interface

Each indexed entity is associated to a Lucene index (an index can be shared by several entities but this is not usually the case). You can configure the index through properties prefixed by hibernate.search.indexname. Default properties inherited to all indexes can be defined using the prefix hibernate.search.default.

To define the directory provider of a given index, you use the hibernate.search.indexname.directory_provider

hibernate.search.default.directory_provider org.hibernate.search.store.FSDirectoryProvider
hibernate.search.default.indexDir=/usr/lucene/indexes

hibernate.search.Rules.directory_provider org.hibernate.search.store.RAMDirectoryProvider

applied on

@Indexed(name="Status")
public class Status { ... }

@Indexed(name="Rules")
public class Rule { ... }

will create a file system directory in /usr/lucene/indexes/Status where the Status entities will be indexed, and use an in memory directory named Rules where Rule entities will be indexed.

So you can easily defined common rules like the directory provider and base directory, and overide those default later on on a per index basis.

Writing your own DirectoryProvider, you can benefit this configuration mechanism too.

5.2.2. Enabling automatic indexing

Finally, we enable the SearchEventListener for the three Hibernate events that occur after changes are executed to the database.

<hibernate-configuration>
    ...
    <event type="post-update" 
        <listener class="org.hibernate.search.event.FullTextIndexEventListener"/>
    </event>
    <event type="post-insert" 
        <listener class="org.hibernate.search.event.FullTextIndexEventListener"/>
    </event>
    <event type="post-delete" 
        <listener class="org.hibernate.search.event.FullTextIndexEventListener"/>
    </event>
</hibernate-configuration>

5.3. Mapping entities to the index structure

All the metadata information related to indexed entities is described through some Java annotations. There is no need for xml mapping files nor a list of indexed entities. The list is discovered at startup time scanning the Hibernate mapped entities.

First, we must declare a persistent class as indexable. This is done by annotating the class with @Indexed (all entities not annotated with @Indexed will be ignored by the indexing process):

@Entity
@Indexed(index="indexes/essays")
public class Essay {
    ...
}

The index attribute tells Hibernate what the Lucene directory name is (usually a directory on your file system). If you wish to define a base directory for all Lucene indexes, you can use the hibernate.search.default.indexDir property in your configuration file. Each entity instance will be represented by a Lucene Document inside the given index (aka Directory).

For each property (or attribute) of your entity, you have the ability to describe how it will be indexed. The default (ie no annotation) means that the property is completly ignored by the indexing process. @Field does declare a property as indexed. When indexing an element to a Lucene document you can specify how it is indexed:

  • name: describe under which name, the property should be stored in the Lucene Document. The default value is the property name (following the JavaBeans convention)

  • store: describe whether or not the property is stored in the Lucene index. You can store the value Store.YES (comsuming more space in the index), store it in a compressed way Store.COMPRESS (this does consume more CPU), or avoid any storage Store.NO (this is the default value). When a property is stored, you can retrieve it from the Lucene Document (note that this is not related to whether the element is indexed or not).

  • index: describe how the element is indexed (ie the process used to index the property and the type of information store). The different values are Index.NO (no indexing, ie cannot be found by a query), Index.TOKENIZED (use an analyzer to process the property), Index.UN_TOKENISED (no analyzer pre processing), Index.NO_NORM (do not store the normalization data).

These attributes are part of the @Field annotation.

Whether or not you want to store the data depends on how you wish to use the index query result. As of today, for a pure Hibernate Search™ usage, storing is not necessary. Whether or not you want to tokenize a property or not depends on whether you wish to search the element as is, or only normalized part of it. It make sense to tokenize a text field, but it does not to do it for a date field (or an id field).

Finally, the id property of an entity is a special property used by Hibernate Search™ to ensure index unicity of a given entity. By design, an id has to be stored and must not be tokenized. To mark a property as index id, use the @DocumentId annotation.

@Entity
@Indexed(index="indexes/essays")
public class Essay {
    ...

    @Id
    @DocumentId
    public Long getId() { return id; }
    
    @Field(name="Abstract", index=Index.TOKENIZED, store=Store.YES)
    public String getSummary() { return summary; }
    
    @Lob
    @Field(index=Index.TOKENIZED)
    public String getText() { return text; }
    
}

These annotations define an index with three fields: id, Abstract and text. Note that by default the field name is decapitalized, following the JavaBean specification.

Note: you must specify @DocumentId on the identifier property of your entity class.

Lucene has the notion of boost factor. It's a way to give more weigth to a field or to an indexed element over an other during the indexation process. You can use @Boost at the field or the class level.

@Entity
@Indexed(index="indexes/essays")
@Boost(2)
public class Essay {
    ...

    @Id
    @DocumentId
    public Long getId() { return id; }
    
    @Field(name="Abstract", index=Index.TOKENIZED, store=Store.YES)
    @Boost(2.5f)
    public String getSummary() { return summary; }
    
    @Lob
    @Field(index=Index.TOKENIZED)
    public String getText() { return text; }
    
}

In our example, Essay's probability to reach the top of the search list will be multiplied by 2 and the summary field will be 2.5 more important than the test field. Note that this explaination is actually wrong, but it is simple and close enought to the reality. Please check the Lucene documentation or the excellent Lucene In Action from Otis Gospodnetic and Erik Hatcher.

The analyzer class used to index the elements is configurable through the hibernate.search.analyzer property. If none defined, org.apache.lucene.analysis.standard.StandardAnalyzer is used as the default.

5.4. Property/Field Bridge

All field of a full text index in Lucene have to be represented as Strings. Ones Java properties have to be indexed in a String form. For most of your properties, Hibernate Search™ does the translation job for you thanks to a built-in set of bridges. In some cases, though you need a fine grain control over the translation process.

5.4.1. Built-in bridges

Hibernate Search comes bundled with a set of built-in bridges between a Java property type and its full text representation.

Null elements are not indexed (Lucene does not support null elements and it does not make much sense either)

null

null elements are not indexed. Lucene does not support null elements and this does not make much sense either.

java.lang.String

String are indexed as is

short, Short, integer, Integer, long, Long, float, Float, double, Double, BigInteger, BigDecimal

Numbers are converted in their String representation. Note that numbers cannot be compared by Lucene (ie used in ranged queries) out of the box: they have to be padded [1]

java.util.Date

Dates are stored as yyyyMMddHHmmssSSS in GMT time (200611072203012 for Nov 7th of 2006 4:03PM and 12ms EST). You shouldn't really bother with the internal format. What is important is that when using a DateRange Query, you should know that the dates have to be expressed in GMT time.

Usually, storing the date up to the milisecond is not necessary. @DateBridge defines the appropriate resolution you are willing to store in the index (@DateBridge(resolution=Resolution.DAY)). The date pattern will then be truncated accordingly.

@Entity @Indexed 
public class Meeting {
    @Field(index=Index.UN_TOKENIZED)
    @DateBridge(resolution=Resolution.MINUTE)
    private Date date;
    ...
}

Warning

A Date whose resolution is lower than MILLISECOND cannot be a @DocumentId

5.4.2. Custom Bridge

It can happen that the built-in bridges of Hibernate Search does not cover some of your property types, or that the String representation used is not what you expect.

5.4.2.1. StringBridge

The simpliest custom solution is to give Hibernate Search™ an implementation of your expected object to String bridge. To do so you need to implements the org.hibernate.search.bridge.StringBridge interface

/**
 * Padding Integer bridge.
 * All numbers will be padded with 0 to match 5 digits
 *
 * @author Emmanuel Bernard
 */
public class PaddedIntegerBridge implements StringBridge {

    private int PADDING = 5;

    public String objectToString(Object object) {
        String rawInteger = ( (Integer) object ).toString();
        if (rawInteger.length() > PADDING) throw new IllegalArgumentException( "Try to pad on a number too big" );
        StringBuilder paddedInteger = new StringBuilder( );
        for ( int padIndex = rawInteger.length() ; padIndex < PADDING ; padIndex++ ) {
            paddedInteger.append('0');
        }
        return paddedInteger.append( rawInteger ).toString();
    }
}

Then any property or field can use this bridge thanks to the @FieldBridge annotation

@FieldBridge(impl = PaddedIntegerBridge.class)
private Integer length;

Parameters can be passed to the Bridge implementation making it more flexible. The Bridge implementation implements a ParameterizedBridge interface, and the parameters are passed through the @FieldBridge annotation.

public class PaddedIntegerBridge implements StringBridge, ParameterizedBridge {

    public static String PADDING_PROPERTY = "padding";
    private int padding = 5; //default

    public void setParameterValues(Map parameters) {
        Object padding = parameters.get( PADDING_PROPERTY );
        if (padding != null) this.padding = (Integer) padding;
    }

    public String objectToString(Object object) {
        String rawInteger = ( (Integer) object ).toString();
        if (rawInteger.length() > padding) throw new IllegalArgumentException( "Try to pad on a number too big" );
        StringBuilder paddedInteger = new StringBuilder( );
        for ( int padIndex = rawInteger.length() ; padIndex < padding ; padIndex++ ) {
            paddedInteger.append('0');
        }
        return paddedInteger.append( rawInteger ).toString();
    }
}


//property
@FieldBridge(impl = PaddedIntegerBridge.class, 
        params = @Parameter(name="padding", value="10") )
private Integer length;

The ParameterizedBridge interface can be implemented by StringBridge, TwoWayStringBridge, FieldBridge implementations (see bellow).

If you expect to use your bridge implementation on for an id property (ie annotated with @DocumentId), you need to use a slightly extended version of StringBridge named TwoWayStringBridge. Hibernate Search needs to read the string representation of the identifier and generate the object out of it. There is not difference in the way the @FieldBridge annotation is used.

public class PaddedIntegerBridge implements TwoWayStringBridge, ParameterizedBridge {

    public static String PADDING_PROPERTY = "padding";
    private int padding = 5; //default

    public void setParameterValues(Map parameters) {
        Object padding = parameters.get( PADDING_PROPERTY );
        if (padding != null) this.padding = (Integer) padding;
    }

    public String objectToString(Object object) {
        String rawInteger = ( (Integer) object ).toString();
        if (rawInteger.length() > padding) throw new IllegalArgumentException( "Try to pad on a number too big" );
        StringBuilder paddedInteger = new StringBuilder( );
        for ( int padIndex = rawInteger.length() ; padIndex < padding ; padIndex++ ) {
            paddedInteger.append('0');
        }
        return paddedInteger.append( rawInteger ).toString();
    }

    public Object stringToObject(String stringValue) {
        return new Integer(stringValue);
    }
}


//id property
@DocumentId
@FieldBridge(impl = PaddedIntegerBridge.class,
            params = @Parameter(name="padding", value="10") )
private Integer id;

It is critically important for the two-way process to be idempotent (ie object = stringToObject( objectToString( object ) ) ).

5.4.2.2. FieldBridge

Some usecase requires more than a simple object to string translation when mapping a property to a Lucene index. To give you most of the flexibility you can also implement a bridge as a FieldBridge. This interface give you a property value and let you map it the way you want in your Lucene Document.This interface is very similar in its concept to the HibernateUserType.

You can for example store a given property in two different document fields

/**
 * Store the date in 3 different field year, month, day
 * to ease Range Query per year, month or day
 * (eg get all the elements of december for the last 5 years)
 *
 * @author Emmanuel Bernard
 */
public class DateSplitBridge implements FieldBridge {
    private final static TimeZone GMT = TimeZone.getTimeZone("GMT");

    public void set(String name, Object value, Document document, Field.Store store, Field.Index index, Float boost) {
        Date date = (Date) value;
        Calendar cal = GregorianCalendar.getInstance( GMT );
        cal.setTime( date );
        int year = cal.get( Calendar.YEAR );
        int month = cal.get( Calendar.MONTH ) + 1;
        int day = cal.get( Calendar.DAY_OF_MONTH );
        //set year
        Field field = new Field( name + ".year", String.valueOf(year), store, index );
        if ( boost != null ) field.setBoost( boost );
        document.add( field );
        //set month and pad it if needed
        field = new Field( name + ".month", month < 10 ? "0" : "" + String.valueOf(month), store, index );
        if ( boost != null ) field.setBoost( boost );
        document.add( field );
        //set day and pad it if needed
        field = new Field( name + ".day", day < 10 ? "0" : "" + String.valueOf(day), store, index );
        if ( boost != null ) field.setBoost( boost );
        document.add( field );
    }
}


//property
@FieldBridge(impl = DateSplitBridge.class)
private Integer length;

5.5. Querying

The second most important capability of Hibernate Search™ is the ability to execute a Lucene query and retrieve entities managed by an Hibernate session, providing the power of Lucene without living the Hibernate paradygm, and giving another dimension to the Hibernate classic search mechanisms (HQL, Criteria query, native SQL query).

To access the Hibernate Search™ querying facilities, you have to use an Hibernate FullTextSession. A SearchSession wrap an regular org.hibernate.Session to provide query and indexing capabilities.

Session session = sessionFactory.openSession();
...
FullTextSession fullTextSession = Search.createFullTextSession(session);

The search facility is built on native Lucene queries.

org.apache.lucene.QueryParser parser = new QueryParser("title", new StopAnalyzer() );

org.hibernate.lucene.search.Query luceneQuery = parser.parse( "summary:Festina Or brand:Seiko" );
org.hibernate.Query fullTextQuery = fullTextSession.createFullTextQuery( luceneQuery );

List result = fullTextQuery.list(); //return a list of managed objects

The Hibernate query built on top of the Lucene query is a regular org.hibernate.Query, you are is the same paradygm as the other Hibernate query facilities (HQL, Native or Criteria). The regular list(), uniqueResult(), iterate() and scroll() can be used.

If you expect a reasonnable result number and expect to work on all of them, list() or uniqueResult() are recommanded. list() work best if the entity batch-size is set up properly. Note that Hibernate Seach has to process all Lucene Hits elements when using list(), uniqueResult() and iterate(). If you wish to minimize Lucene document loading, scroll() is more appropriate, Don't forget to close the ScrollableResults object when you're done, since it keeps Lucene resources.

An efficient way to work with queries is to use pagination. The pagination API is exactly the one available in org.hibernate.Query:

org.hibernate.Query fullTextQuery = fullTextSession.createFullTextQuery( luceneQuery );
fullTextQuery.setFirstResult(30);
fullTextQuery.setMaxResult(20);
fullTextQuery.list(); //will return a list of 20 elements starting from the 30th

Only the relevant Lucene Documents are accessed.

5.6. Indexing

It is sometimes useful to index an object event if this object is not inserted nor updated to the database. This is especially true when you want to build your index the first time. You can achieve that goal using the FullTextSession.

FullTextSession fullTextSession = Search.createFullTextSession(session);
Transaction tx = fullTextSession.beginTransaction();
for (Customer customer : customers) {
    fullTextSession.index(customer);
}
tx.commit(); //index are written at commit time

For maximum efficiency, Hibernate Search batch index operations which and execute them at commit time (Note: you don't need to use org.hibernate.Transaction in a JTA environment).



[1] Using a Range query is debattable and has drawbacks, an alternative approach is to use a Filter query which will filter the result query to the appropriate range.

Hibernate Search™ will support a padding mechanism