A Jdbc based implementation of Lucene Directory allowing the storage of a Lucene index within a database. Enables existing or new Lucene based application to store the Lucene index in a database with no or minimal change to typical Lucene code fragments.
The JdbcDirectory is highly configurable, using the optional JdbcDirectorySettings. All the settings are described in the javadoc, and most of them will be made clear during the next sections.
There are several options to instantiate a Jdbc directory, they are:
Table B.1. Jdbc Directory Constructors
Parameters | Description |
---|---|
DataSource, Dialect, tableName | Creates a new JdbcDirectory using the given data source and dialect. JdbcTable and JdbcDirectorySettings are created based on default values. |
DataSource, Dialect, JdbcDirectorySettings, tableName | Creates a new JdbcDirectory using the given data source, dialect, and JdbcDirectorySettings. The JdbcTable is created internally. |
DataSource, JdbcTable | Creates a new JdbcDirectory using the given dialect, and JdbcTable. Creating a new JdbcTable requires a Dialect and JdbcDirectorySettings. |
The Jdbc directory works against a single table (where the table name must be provided when the directory is created). The table schema is described in the following table:
Table B.2. Jdbc Directory Table Schema
Column Name | Column Type | Default Column Name | Description |
---|---|---|---|
Name | VARCHAR | name_ | The file entry name. Similar to a file name within a file system directory. The column size is configurable and defaults to 50. |
Value | BLOB | value_ | A binary column where the content of the file is stored. Based on Jdbc Blob type. Can have a configurable size where appropriate for the database type. |
Size | NUMBER | size_ | The size of the current saved data in the Value column. Similar to the size of a file in a file system. |
Last Modified | TIMESTAMP | lf_ | The time that file was last modified. Similar to the last modified time of a file within a file system. |
Deleted | BIT | deleted_ | If the file is deleted or not. Only used for some of the file types based on the Jdbc directory. More is explained in later sections. |
The Jdbc directory provides the following operations on top of the ones forced by the Directory interface:
Table B.3. Extended Jdbc Directory Operations
Operation Name | Description |
---|---|
create | Creates the database table (with the above mentioned schema). The create operation drops the table first. |
delete | Drops the table from the database. |
deleteContent | Deletes all the rows from the table in the database. |
tableExists | Returns if the table exists or not. Only supported on some of the databases. |
deleteMarkDeleted | Deletes all the file entries that are marked to be deleted, and they were marked, and they were marked "delta" time ago (base on database time, if possible by dialect). The delta is taken from the JdbcDirectorySettings, or provided as a parameter to the deleteMarkDeleted operation. |
The Jdbc directory requires a Dialect implementation that is specific to the database used with it. The following is a table listing the current dialects supported with the Jdbc directory:
Table B.4. Jdbc Directory SQL Dialects
Dialect | RDBMS | Blob Locator Support* |
---|---|---|
org.apache.lucene.store.jdbc.dialect.OracleDialect | Oracle | Oracle Jdbc Driver - Yes |
org.apache.lucene.store.jdbc.dialect.SQLServerDialect | Microsoft SQL Server | jTds 1.2 - No. Microsoft Jdbc Driver - Unknown |
org.apache.lucene.store.jdbc.dialect.MySQLDialect | MySQL | MySQL Connector J 3.1/5 - Yes with emulateLocators=true in connection string. |
org.apache.lucene.store.jdbc.dialect.MySQLInnoDBDialect | MySQL with InnoDB. | See MySQL |
org.apache.lucene.store.jdbc.dialect.MySQLMyISAMDialect | MySQL with MyISAM | See MySQL |
org.apache.lucene.store.jdbc.dialect.PostgreSQLDialect | PostgreSQL | Postgres Jdbc Driver - Yes. |
org.apache.lucene.store.jdbc.dialect.SybaseDialect | Sybase / Sybase Anywhere | Unknown. |
org.apache.lucene.store.jdbc.dialect.InterbaseDialect | Interbase | Unknown. |
org.apache.lucene.store.jdbc.dialect.FirebirdDialect | Firebird | Unknown. |
org.apache.lucene.store.jdbc.dialect.DB2Dialect | DB2 / DB2 AS400 / DB2 OS390 | Unknown. |
org.apache.lucene.store.jdbc.dialect.DerbyDialect | Derby | Derby Jdbc Driver- Unknown. |
org.apache.lucene.store.jdbc.dialect.HSQLDialect | HypersonicSQL | HSQL Jdbc Driver - No. |
* A Blob locator is a pointer to the actual data, which allows fetching only portions of the Blob at a time. Databases (or Jdbc drivers) that do not use locators usually fetch all the Blob data for each query (which makes using them impractical for large indexes). Note, the support documented here does not cover all the possible Jdbc drivers, please refer to your Jdbc driver documentation for more information.
Minor performance improvements can be gained if JdbcTable is cached and used to create different JdbcDirectory instances.
It is best to use a pooled data source (like Jakarta Commons DBCP), so Connections won't get created every time, but be pooled.
Most of the time, when working with Jdbc directory, it is best to work in a non compound index format. Since with databases there is no problem of too many files open, it won't be an issue. The package comes with a set of utilities to compound or uncompund an index, located in the org.apache.lucene.index.LuceneUtils class, just in case you already have an index and it is in the wrong structure.
When indexing data, a possible performance improvement can be to index the data into the file system or memory, and then copy over the contents of the index to the database. org.apache.lucene.index.LuceneUtils comes with a utility to copy one directory to the other, and changing the compound state of the index while copying.
JdbcDirectory performs no transaction management. All database related operations WITHIN IT work in the following manner:
Connection conn = DataSourceUtils.getConnection(dataSource); // perform any database related operation using the connection DataSourceUtils.releaseConnection(conn);
As you can see, no commit or rollback are called on the connection, allowing for any type of transaction management done outside of the actual JdbcDirectory related operations. Also, the fact that we are using the Jdbc DataSource, allows for plug able transaction management support (usually based on DataSource delegate and Connection proxy). DataSourceUtils is a utility class that comes with the Jdbc directory, and it's usage will be made clear in the following sections.
There are several options when it comes to transaction management, and they are:
When configuring the DataSource or the Connection to use autoCommit (set it to true), no transaction management is required. Additional benefit is that any existing Lucene code will work as is with the JdbcDirectory (assuming that the Directory class was used instead of the actual implementation type).
The main problems with using the Jdbc directory in the autoCommit mode are: performance suffers because of it, and not all database allow to use Blobs with autoCommit. As you will see later on, other transaction management are simple to use, and the Jdbc directory comes with a set of helper classes that make the transition into a "Jdbc directory enabled code" simple.
When the application does not use any transaction managers (like JTA or Spring's PlatformTransactionManager), the Jdbc directory comes with a simple local transaction management based on Connection proxy and thread bound Connections.
The TransactionAwareDataSourceProxy can wrap a DataSource, returning Jdbc Connection only if there is no existing Connection that was opened before (within the same thread) and not closed yet. Any call to the close method on this type of Connection (which we call a "not controlled" connection) will result in a no op. The DataSourceUtils#releaseConnection will also take care and not close the Connection if it is not controlled.
So, how do we rollback or commit the Connection? DataSourceUtils has two methods, commitConnectionIfPossible and rollbackConnectionIfPossible, which will only commit/rollback the Connection if it was proxied by the TransactionAwareDataSourceProxy, and it is a controlled Connection.
A simple code that performs the above mentioned:
JdbcDirectory jdbcDir = // ... create the jdbc directory Connection conn = DataSourceUtils.getConnection(dataSource); try { IndexReader indexReader = new IndexReader(jdbcDir); // you can also use an already open IndexReader // ... DataSourceUtils.commitConnectionIfPossible(conn); // will commit the connection if controlling it } catch (IOException e) { DataSourceUtils.safeRollbackConnectionIfPossible(conn); throw e; } finnaly { DataSourceUtils.releaseConnection(conn); }
Note, that the above code will also work when you do have a transaction manager (as described in the next section), and it forms the basis for the DirectoryTemplate (described later) that comes with Jdbc directory.
For environments that use external transaction managers (like JTA or Spring PlatformTransactionManager), the transaction management should be performed outside of the code that use the Jdbc directory. Do not use Jdbc directory TransactionAwareDataSourceProxy.
For JTA for example, if Container Managed transaction is used, the executing code should reside within it. If not, JTA transaction should be executed programmatically.
When using Spring, the executing code should reside within a transactional context, using either transaction proxy (AOP), or the PlatformTransactionManager and the TransactionTemplate programmatically. IMORTANT: When using Spring, you should wrap the DataSource with Spring's own TransactionAwareDataSourceProxy.
Since transaction management might require specific code to be written, Jdbc directory comes with a DirectoryTemplate class, which allows writing Directory implementation and transaction management vanilla code. The directory template perform transaction management support code only if the Directory is of type JdbcDirectory and the transaction management is a local one (Data Source transaction management).
Each directory based operation (done by Lucene IndexReader, IndexSearcher and IndexWriter) should be wrapped by the DirectoryTemplate. An example of using it:
DirectoryTemplate template = new DirectoryTemplate(dir); // use a pre-configured directory template.execute(new DirectoryTemplate.DirectoryCallbackWithoutResult() { protected void doInDirectoryWithoutResult(Directory dir) throws IOException { IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true); // index write operations write.close(); } }); // or, for example, if we have a cached IndexSearcher template.execute(new DirectoryTemplate.DirectoryCallbackWithoutResult() { protected void doInDirectoryWithoutResult(Directory dir) throws IOException { // indexSearcher operations } });
A FileEntryHandler is an interface used by the Jdbc directory to delegate file level operations to it. The JdbcDirectorySettings has a default file entry handler which handles all unmapped file names. It also provides the ability to register a FileEntryHandler against either an exact file name, or a file extension (3 characters after the '.').
When the JdbcDirectory is created, all the different file entry handlers that are registered with the directory settings are created and configured. They will than be used to handle files based on the file names.
When registering a new file entry handler, it must be registered with JdbcFileEntrySettings. The JdbcFileEntrySettings is a fancy wrapper around java Properties in order to provide an open way for configuring file entry handlers. When creating a new JdbcFileEntrySettings it already has sensible defaults (refer to the javadoc for them), but of course they can be changed. One important configuration setting is the type of the FileEntryHandler, which should be set under the constant setting name: JdbcFileEntrySettings#FILE_ENTRY_HANDLER_TYPE and should be the fully qualified class name of the file entry handler.
The Jdbc directory package comes with three different FileEntryHandlers. They are:
Table B.5. File Entry Handler Types
Type | Description |
---|---|
org.apache.lucene.store.jdbc.handler. NoOpFileEntryHandler | Performs no operations. |
org.apache.lucene.store.jdbc.handler. ActualDeleteFileEntryHandler | Performs actual delete from the database when the different delete operations are called. Also support configurable IndexInput and IndexOutput (described later). |
org.apache.lucene.store.jdbc.handler. MarkDeleteFileEntryHandler | Marks entries in the database as deleted (using the deleted column) when the different delete operations are called. Also support configurable IndexInput and IndexOutput (described later). |
Most of the files use the MarkDeleteFileEntryHandler, since there might be other currently open IndexReaders or IndexSearchers that use the files. The JdbcDirectory provide the deleteMarkDeleted() and deleteMarkDeleted(delta) to actually purge old entries that are marked as deleted. It should be scheduled and executed once in a while in order to keep the database table compact.
When creating new JdbcDirectorySettings, it already registers different file entry handlers for specific files automatically. For example, the deleted file is registered against a NoOpFileEntryHandler since we will always be able to delete entries from the database (the deleted file is used to store files that could not be deleted from the file system). This results in better performance since no operations are executed against the deleted (or deleted related files). Another example, is registering the ActualDeleteFileEntryHandler against the segments file, since we do want to delete it and replace it with a new one when it is written.
Each file entry handler can be associated with an implementation of IndexInput. Setting the IndexInput should be set under the constant JdbcFileEntrySettings#INDEX_INPUT_TYPE_SETTING and be the fully qualified class name of the IndexInput implementation.
The Jdbc directory comes with the following IndexInput types:
Table B.6. Index Input Types
Type | Description |
---|---|
org.apache.lucene.store.jdbc.index. FetchOnOpenJdbcIndexInput | Fetches and caches all the binary data from the database when the IndexInput is opened. Perfect for small sized file entries (like the segments file). |
org.apache.lucene.store.jdbc.index. FetchOnBufferReadJdbcIndexInput | Extends the JdbcBufferedIndexInput class, and fetches the data from the database every time the internal buffer need to be refilled. The JdbcBufferedIndexInput allows setting the buffer size using the JdbcBufferedIndexInput#BUFFER_SIZE_SETTING. Remember, that you can set different buffer size for different files by registering different file entry handlers with the JdbcDirectorySettings. |
org.apache.lucene.store.jdbc.index. FetchPerTransactionJdbcIndexInput | Caches blobs per transaction. Only supported for dialects that supports blobs per transaction. Note, using this index input requires calling the FetchPerTransactionJdbcIndexInput#releaseBlobs(java.sql.Connection) when the transaction ends. It is automatically taken care of if using TransactionAwareDataSourceProxy. If using JTA for example, a transcation synchronization should be registered with JTA to clear the blobs. Extends the JdbcBufferedIndexInput class, and fetches the data from the database every time the internal buffer need to be refilled. The JdbcBufferedIndexInput allows setting the buffer size using the JdbcBufferedIndexInput#BUFFER_SIZE_SETTING. Remember, that you can set different buffer size for different files by registering different file entry handlers with the JdbcDirectorySettings. |
The JdbcDirectorySettings automatically registers sensible defaults for the default file entry handler and specific ones for specific files. Please refer to the javadocs for the defaults.
Each file entry handler can be associated with an implementation of IndexOutput. Setting the IndexOutput should be set under the constant JdbcFileEntrySettings#INDEX_OUTPUT_TYPE_SETTING and be the fully qualified class name of the IndexOutput implementation.
The Jdbc directory comes with the following IndexOutput types:
Table B.7. Index Output Types
Type | Description |
---|---|
org.apache.lucene.store.jdbc.index. RAMJdbcIndexOutput | Extends the JdbcBufferedIndexOutput class, and stores the data to be written in memory (within a growing list of bufferSize sized byte arrays). The JdbcBufferedIndexOutput allows setting the buffer size using the JdbcBufferedIndexOutput#BUFFER_SIZE_SETTING. Perfect for small sized file entries (like the segments file). |
org.apache.lucene.store.jdbc.index. FileJdbcIndexOutput | Extends the JdbcBufferedIndexOutput class, and stores the data to be written in a temporary file. The JdbcBufferedIndexOutput allows setting the buffer size using the JdbcBufferedIndexOutput#BUFFER_SIZE_SETTING (a write is performed every time the buffer is flushed). |
org.apache.lucene.store.jdbc.index. RAMAndFileJdbcIndexOutput | A special index output, that first starts with a RAM based index output, and if a configurable threshold is met, switches to file based index output. The threshold setting cab be configured using RAMAndFileJdbcIndexOutput#INDEX_OUTPUT_THRESHOLD_SETTING. |
The JdbcDirectorySettings automatically registers sensible defaults for the default file entry handler and specific ones for specific files. Please refer to the javadocs for the defaults.