The only required configuration for a Compass instance (using the CompassConfiguration) is its connection. The connection controls where the index will be saved, or in other words, the storage location of the index. This chapter will review the different options of index storage that comes with Compass, and try to expand on some of important aspects when using a certain storage (like clustering support).
By far the most popular and simple of all storage options is storing the index on the file system. Here is an example of a simple file system based connection configuration that stores the index in the target/test-index path:
<compass name="default"> <connection> <file path="target/test-index"/> </connection> </compass>
Another option for file system based configuration is using Java 1.4 NIO feature. The NIO should perform better under certain environment/load then the default file based one. We recommend performing some performance tests (preferable as close to your production system configuration as possible), and check which one performs better. Here is an example of a simple file system based connection configuration that stores the index in the target/test-index path:
<compass name="default"> <connection> <mmap path="target/test-index"/> </connection> </compass>
When using file system based index storage, locking (for transaction support) is done using lock files. The existence of the file means a certain sub index is locked. The default lock file directory is java.io.tmp system property.
Clustering support for file system based storage usually means sharing the file system between different machines (running different Compass instances). Current locking mechanism will require to set the locking directory on the shared file system, here is an example of how to set it:
<compass name="default"> <connection> <mmap path="/shared/index-data"/> </connection> <transaction lockDir="/shared/index-lock" /> </compass>
Another important note regarding using a shared file system based index storage is not to use NFS. For best performance, a SAN based solution is recommended.
Using the RAM based index store, the index data can be stored in memory. This is usable for fast indexing and searching, on the expense of no long lived storage. Here is an example of how it can be configured:
<compass name="default"> <connection> <ram path="/index"/> </connection> </compass>
The Jdbc store connection type allows the index data to be stored within a database. The schema used for storing the index actually simulates a file system based tree, with each row in a sub index table representing a "file" with its binary data.
Compass implementation, JdbcDirectory, which is built on top of Lucene Directory abstraction is completely decoupled from the rest of Compass, and can be used with pure Lucene applications. For more information, please read Appendix B, Lucene Jdbc Directory. Naturally, when using it within Compass it allows for simpler configuration, especially in terms of transaction management and Jdbc DataSource management.
Here is a simple example of using Jdbc to store the index. The example configuration assumes a standalone configuration, with no data source pooling.
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <driverManager url="jdbc:hsqldb:mem:test" username="sa" password="" driverClass="org.hsqldb.jdbcDriver" /> </dataSourceProvider> </jdbc> </connection> </compass>
The above configuration does not define a dialect attribute on the jdbc element. Compass will try to auto-detect the database dialect based on the database meta-data. If it fails to find one, a dialect can be set, in our case it should be dialect="org.apache.lucene.store.jdbc.dialect.HSQLDialect".
It is important to understand if Compass is working within a "managed" environment or not when it comes to a Jdbc index storage. A managed environment is an environment where Compass is not in control of the transaction management (in case of configuring Compass with JTA or Spring transaction management). If Compass is in control of the transaction, i.e. using Local transaction factory, it is not considered a managed environment.
When working in a non managed environment, Compass will wrap the data source with a TransactionAwareDataSourceProxy, and will commit/rollback the Jdbc connection. When working within a managed environment, no wrapping will be performed, and Compass will let the external transaction manager to commit/rollback the connection.
Usually, but not always, when working in a managed environment, the Jdbc data source used will be from an external system/configuration. Most of the times it will either be JNDI or external data source provider (like Spring). For more information about different data source providers, read the next section.
By default, Compass works as if within a non managed environment. The managed attribute on the jdbc element should be set to true otherwise.
Compass allows for different Jdbc DataSource providers. A DataSourceProvider implementation is responsible for configuring and providing a Jdbc DataSource instance. A data source implementation is very important when it comes to performance, especially in terms of pooling features.
All different data source supported by Compass allow to configure the autoCommit flag. There are three values allowed for autoCommit: false, true and external (don't set the autoCommit explicitly, assume it is configured elsewhere). The autoCommit mode defaults to false and it is the recommended value (external can also be used, but make sure to set the actual data source to false).
The simplest of all providers. Does not requires any external libraries or systems. Main drawback is performance, since it performs no pooling of any kind. The first sample of a Jdbc configuration earlier in this chapter used the driver manager as a data source provider.
Compass can be configured to use Jakarta Commons DBCP as a data source provider. It is the preferred option than the driver manager provider for performance reasons (it is up to you if you want to use it or c3p0 explained later in this section). Here is an example of using it:
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <dbcp url="jdbc:hsqldb:mem:test" username="sa" password="" driverClass="org.hsqldb.jdbcDriver" maxActive="10" maxWait="5" maxIdle="2" initialSize="3" minIdle="4" poolPreparedStatements="true" /> </dataSourceProvider> </jdbc> </connection> </compass>
The configuration shows the different settings that can be used on the dbcp data source provider. They are, by no means, the recommended values for a typical system. For more information, please consult Jakarta Commons DBCP documentation.
Compass can be configured using c3p0 as a data source provider. It is the preferred option than the driver manager provider for performance reasons (it is up to you if you want to use it or Jakarta Commons DBCP explained previously in this section). Here is an example of using it:
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <c3p0 url="jdbc:hsqldb:mem:test" username="testusername" password="testpassword" driverClass="org.hsqldb.jdbcDriver" /> </dataSourceProvider> </jdbc> </connection> </compass>
The c3p0 data source provider will use c3p0 ComboPooledDataSource, with additional settings can be set by using c3p0.properties stored as a top-level resource in the same CLASSPATH / classloader that loads c3p0's jar file. Please consult the c3p0 documentation for additional settings.
Compass can be configured to look up the data source using JNDI. Here is an example of using it:
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <jndi lookup="testds" username="testusername" password="testpassword" /> </dataSourceProvider> </jdbc> </connection> </compass>
The jndi lookup environment, including the java.naming.factory.initial and java.naming.provider.url JNDI settings, can be configured in the other :) jndi element, directly under the compass element. Note, the username and password are used for the DataSource, and are completely optional.
Compass can be configured to use an external data source using the ExteranlDataSourceProvider. It uses Java thread local to store the DataSource for later use by the data source provider. Setting the data source uses the static method setDataSource(DataSource dataSource) on ExteranlDataSourceProvider. Here is an example of how it can be configured:
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <external username="testusername" password="testpassword"/> </dataSourceProvider> </jdbc> </connection> </compass>
Note, the username and password are used for the DataSource, and are completely optional.˙
Configuring the Jdbc store with Compass also allows defining FileEntryHandler settings for different file entries in the database. FileEntryHandlers are explained in Appendix B, Lucene Jdbc Directory (and require some Lucene knowledge). The Lucene Jdbc Directory implementation already comes with sensible defaults, but they can be changed using Compass configuration.
One of the things that comes free with Compass is automatically using the more performant FetchPerTransactoinJdbcIndexInput if possible (based on the dialect). Special care need to be taken when using the mentioned index input, and it is done automatically by Compass.
File entries configuration are associated with a name. The name can be either __default__ which is used for all unmapped files, it can be the full name of the file stored, or the suffix of the file (the last 3 characters).
Here is an example of the most common configuration of file entries, changing their buffer size for both index input (used for reading data) and index output (used for writing data):
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <external username="testusername" password="testpassword"/> </dataSourceProvider> <fileEntries> <fileEntry name="__default__"> <indexInput bufferSize="4096" /> <indexOutput bufferSize="4096" /> </fileEntry> </fileEntries> </jdbc> </connection> </compass>
Compass by default can create the database schema, and has defaults for the column names, types, sizes and so on. The schema definition is configurable as well, here is an example of how to configure it:
<compass name="default"> <connection> <jdbc> <dataSourceProvider> <external username="testusername" password="testpassword"/> </dataSourceProvider> <ddl> <nameColumn name="myname" length="70" /> <sizeColumn name="mysize" /> </ddl> </jdbc> </connection> </compass>
Compass by default will drop the tables when deleting the index, and create them when creating the index. If performing schema based operations is not allowed, the disableSchemaOperations flag can be set to true. This will cause Compass not to perform any schema based operations.
Lucene allows to use different LockFactory implementation controlling how locks are performed. By default, each directory comes with its own default lock, but overriding the lock factory can be done within Compass configuration. Here is an example of how this can be done:
<compass name="default"> <connection> <file path="target/test-index" /> <lockFactory type="nativefs" path="test/#subindex#" /> </connection> </compass>
The lock factory type can have the following values: simplefs, nativefs (both file system based locks), nolock, and singleinstance. A fully qualified class name of LockFactory implementation or LockFactoryProvider can also be provided.
The path allows to provide path parameter to the file system based locks. This is an optional parameter and defaults to the sub index location. The specialized keyword #subindex# can be used to be replaced with the actual sub index.
Compass supports local directory cache implementation allowing to have a local cache per sub index or globally for all sub indexes (that do not have a local cache already specifically defined for them). Local cache can be really useful where a certain sub index is heavily accessed and a local in memory cache is required to improve its performance. Another example is using a local file system based cache when working with a Jdbc directory.
Local Cache fully supports several Compass instances running against the same directory (unlike the directory wrappers explained in the next section) and keeps its local cache state synchronized with external changes periodically.
Here is an example configuring a memory based local cache for sub index called a:
<compass name="default"> <connection> <file path="target/test-index" /> <localCache subIndex="a" connection="ram://" /> </connection> </compass>
And here is an example of how it can be configured to use local file system cache for all different sub indexes (using the special __default__ keyword):
<compass name="default"> <connection> <file path="target/test-index" /> <localCache subIndex="__default__" connection="file://tmp/cache" /> </connection> </compass>
Other than using a faster local cache directory implementation, Compass also improve compound file structure performance by performing the compound operation on the local cache and only flushing the already compound index structure.
All the different connection options end up as an instance of a Lucene Directory per sub index. Compass provides the ability to wrap the actual Directory (think of it as a Directory aspect). In order to configure a wrapper, DirectoryWrapperProvider implementation must be provided. The DirectoryWrapperProvider implementation must implement Directory wrap(String subIndex, Directory dir), which accepts the actual directory and the sub index it is associated with, and return a wrapped Directory implementation.
Compass comes with several built in directory wrappers:
Wraps the given Lucene directory with SyncMemoryMirrorDirectoryWrapper (which is also provided by Compass). The wrapper wraps the directory with an in memory directory which mirrors it synchronously.
The original directory is read into memory when the wrapper is constructed. All read related operations are performed against the in memory directory. All write related operations are performed both against the in memory directory and the original directory. Locking is performed using the in memory directory.
The wrapper will allow for the performance gains that comes with an in memory index (for read/search operations), while still maintaining a synchronized actual directory which usually uses a more persistent store than memory (i.e. file system).
This wrapper will only work in cases when either the index is read only (i.e. only search operations are performed against it), or when there is a single instance which updates the directory.
Here is an example of how to configure a directory wrapper:
<compass name="default"> <connection> <file path="target/test-index"/> <directoryWrapperProvider name="test" type="org.compass.core.lucene.engine.store.wrapper.SyncMemoryMirrorDirectoryWrapperProvider"> </directoryWrapperProvider> </connection> </compass>
Wraps the given Lucene directory with AsyncMemoryMirrorDirectoryWrapper (which is also provided by Compass). The wrapper wraps the directory with an in memory directory which mirrors it asynchronously.
The original directory is read into memory when the wrapper is constructed. All read related operations are performed against the in memory directory. All write related operations are performed against the in memory directory and are scheduled to be performed against the original directory (in a separate thread). Locking is performed using the in memory directory.
The wrapper will allow for the performance gains that comes with an in memory index (for read/search operations), while still maintaining an asynchronous actual directory which usually uses a more persistent store than memory (i.e. file system).
This wrapper will only work in cases when either the index is read only (i.e. only search operations are performed against it), or when there is a single instance which updates the directory.
Here is an example of how to configure a directory wrapper:
<compass name="default"> <connection> <file path="target/test-index"/> <directoryWrapperProvider name="test" type="org.compass.core.lucene.engine.store.wrapper.AsyncMemoryMirrorDirectoryWrapperProvider"> <setting name="awaitTermination">10</setting> <setting name="sharedThread">true</setting> </directoryWrapperProvider> </connection> </compass>
awaitTermination controls how long the wrapper will wait for the async write tasks to finish. When closing Compass, there might be still async tasks pending to be written to the actual directory, and the setting control how long (in seconds) Compass will wait for tasks to be executed against the actual directory. sharedThread set to false controls if each sub index will have its own thread to perform pending "write" operations. If it is set to true, a single thread will be shared among all the sub indexes.