Indexing & Searching

This chapter presents the architecture of the indexing and search service in Nuxeo EP 5.

10.1. Introduction

This chapter is under construction. XXX TODO: GR+JA

10.2. Configuration

For obvious performance and volume considerations, the search service doesn't index all the content of the application, nor does it provide the full content in search results. This must be specified in the configuration, along with what to do with the data.

The search service configuration is done by the standard Nuxeo Runtime extension point system. The target is org.nuxeo.ecm.core.search.service For a complete example, check the default configuration file: nuxeo-platform-search-core/OSGI_INF/nxsearch-contrib.xml (from Nuxeo EP sources).

10.2.1. Concepts

The main concepts are Resource and Field. A Resource holds several fields. It has a name and a prefix, which is used, e.g, in queries. Resources are supposed to be unsplittable, but they are usually aggregated. It's up to the backend to decide how to handle that aggregation; this goes beyond the scope of the present documentation.

At this point the search service handles documents only, so it's safe to say that documents correspond to aggregates of resources, and resources correspond to schemas. In the future, there'll be more types of resources.

10.2.2. The `indexableDocType` extension point

The core types of documents to index have to be registered against this extension point, in which the schemas to index are bound to each document type.

Here's an example demonstrating the available patterns:

<extension target="org.nuxeo.ecm.core.search.service.SearchServiceImpl"
           point="indexableDocType">
(...)
  <indexableDocType name="DocType1" indexAllSchemas="true"/>
  <indexableDocType name="DocType2" indexAllSchemas="true">
    <exclude>unwanted_schema</exclude>
  </indexableDocType>
  <indexableDocType name="DocType3">
    <resource>indexed_schema</resource>
  </indexableDocType>
</extension>

In this example, the Search service will index all schemas for documents of type DocType1, all schemas except unwanted_schema for type DocType2 and only indexed_schema for type DocType3.

Each ot these indexable schemas will be processed according to the corresponding indexable schema resource if available in the configuration, or to the default one .

In particular, the behavior of data from a given schema is homogeneous across all document types.

10.2.3. The `resource` extension point

This extension point is used to declare an indexable resource. In 5.1M2, the only provided indexable resource type is the schema resource type. but the logic will stay the same for future types. Recall that resources are made of fields. Here's an example of schema indexing resource, without fields details.

<extension target="org.nuxeo.ecm.core.search.service.SearchServiceImpl"
           point="resource" type="schema">

  <resource name="dublincore" prefix="dc" indexAllFields="true">
    <excludedField>issued</excludedField>
    <field name="title" (...) />
  </resource>

  <resource name="book" prefix="bk" type="schema">
    <field name="barcode" (...) />
  </resource>

</extension>

The type specified that the resource is a docuemnt schema resource.

The name and prefix attributes are mandatory. They should match the ones from the schema extension point of org.nuxeo.ecm.core.schema.TypeService as in, e.g.:

<schema name="dublincore" prefix="dc" src="schema/dublincore.xsd"/>
<schema name="common" src="schema/common.xsd" />

A missing prefix in the core configuration as in the common schema declaration in the example above will default to the schema name.

The prefix is important, since all subsequent references to the fields (in queries and raw search results) take the prefix:fieldName form.

10.2.4. Field configuration

Fields behavior is designed to be uniform across the different kinds of resources.

10.2.4.1. The `type` attribute

The field type tells the search engine how to process the field. This is a mandatory, case-insensitive, attribute.

The following table summarizes the available types. The listed Java classes are guaranteed to work. The backend might implement more converting facilities.

Type	Java classes	Comment
Keyword	String	Meant for technical non binary strings such as vocabulary keys, user identifiers etc. Equality matches are guaranteed to be exact.
Text	String	Upon indexation, these fields are tokenized and analyzed to support fulltext queries.
Path	org.nuxeo.common.utils.Path, String	Dedicated to `STARTSWITH` queries. Equality queries are not guaranteed.
Date	java.util.Calendar	For date and time indexing
Int	int
Builtin	Reserved for internal use

10.2.5. Text fields and analyzers

At indexing time, the contents of text fields goes through the process of tokenization and analysis, whose main goal is to provide fulltext search capabilities, usually at the expense of exact matches.

Tokenization means converting the textual content in a stream of words (tokens). During the analysis step, this raw stream is altered to better suit the needs of fulltext searches. It is for example common practice to strip the stream of so-called stop words (most commons words in the language) and to degrade accented characters. One can apply further linguistical processing, such as stemming or synonym analysis.

The name of analyzer to use on a given text field is specified through the analyzer attribute. The Search Service acts as a pure forwarder, sending the raw text to the backend, along with the specified analyzer name.

The default value for the analyzer attribute is default. The attribute is simply ignored for other field types.

10.2.6. Boolean attributes

To enable queries on a given field, one must set the indexed attribute to true.
To make it possible to provide the full field value in search results, one must set the stored attribute to true. One must keep in mind that the purpose is to present the user a limited yet sufficient set of information along with search results, i.e, in a swifter way than having to fetch it from the core, and not to duplicate the content in the search databases.
Multivalued fields have to be flagged by setting the multiple attribute to true.
Depending on field types and on the processing that's been done by the backend, the possibility to use the field value as a sort key might require some additional resources. To force the backend to give this extra effort, set the sortable attribute to true.
The binary attribute is used to mark binary fields, e.g, to trigger conversions (not used in 5.1M2).

10.2.7. Schema resources and fields without configuration

10.2.8. Schema resources

For a schema resource that isn't explicitely declared and nevertheless used, for instance because of an indexAllSchemas statement, a default configuration is inferred, with the prefix read from the Core configuration, and fields as below, i.e., as if there was an indexAllFields="true" attribute.

10.2.9. Automatic fields configuration

auto-configured fields are unstored
If there is only one relevant type from the table above, it is applied.
The multiple attribute is properly set.
auto-configured String fields get the keyword type.

<resource name="dublincore" prefix="dc" indexAllFields="true" type="schema">
  <field name="title" analyzer="default"
         stored="true" indexed="true" type="Text"
         binary="false" sortable="true"/>
</resource>

Example 10.1. Relying on automatic field configuration on all fields but one

10.3. Programmatic Searching

The search service exposes the searchQuery method as a unique entry point. The method takes as input an instance of ComposedNXQuery, which encapsulates the parsed NXQL query and a SearchPrincipal instance that will be checked against the security indexes, and paging information for the results.

Although the input of searchQuery is an already parsed NXQL statement, we'll use NXQL query strings in the sequel for clarity. NXQL query strings are parsed by the method parse of the static org.nuxeo.ecm.core.query.sql.SQLQueryParser class.

10.3.1. Fields and literals

Within NXQL requests, references to fields values have to follow the "prefix:fieldName" scheme, where prefix and fieldName have been specified through the configuration extension points (recall that for automatically indexed schema resources, the prefix defaults to the one defined in the schema definition).

Literals (constants) follow the JCR specifications. Notably:

String literals have to be enclosed in single quotes
Lists have to be enclosed in parenthesizes

Recall that "dc" is the prefix for the "dublincore" schema.

SELECT * FROM Document WHERE dc:title='Nuxeo book' ORDER BY dc:created DESC

Example 10.2. Sample NXQL queries

10.3.2. `WHERE` statements

Most WHERE statements behave as described in the JCR specification, which is itself based on general SQL. Instead of covering every aspect of NXQL, in the current state of this documentation, we'll focus on differences and behaviours that might appear to be counter intuitive.

10.3.2.1. Text fields

Although the Search Service is meant to provide an unified abstraction on the tasks of indexing and querying, text fields have to be somewhat an exception. Indeed, search engines have very different capabilities, depending on provided analyzers. They are nonetheless all expected to provide a direct syntax for full text searches, that an end user can use from, e.g., an input box on a web page. Given the very special kind of constraint that indexing a text field represents, it's not guaranteed that exact matches are supported.

See Section 10.4.4, “Text fields behavior” from the documentation of the Compass backend to get a more concrete view on this (with examples).

Conclusions

The backend uses the closest thing to exact matches it supports to treat = predicates.
The syntax of LIKE predicates is backend dependent. It follows the backend's full text query syntax
The CONTAINS from JCR is not supported. Use a LIKE statement on the main full-text field (see Section 10.3.2.3, “Pseudo fields”).

10.3.2.2. Multi-valued fields

On a multi-valued field, the = operator is true if the right operand belongs to the set of field values. The IN operator is true if the intersection between the set of field values and the right operand is non empty.

SELECT * FROM Document WHERE dc:contributors = 'sally'

Example 10.3. Finding documents on which user sally contributed

SELECT * FROM Document WHERE dc:contributors IN ('sally', 'harry')

Example 10.4. Finding documents on which user sally or harry contributed

This behavior is in conformance with the JCR specification, which states it the following more general terms:

In the WHERE clause the comparison operators function the same way they do in XPath when applied to multi-value properties: if the predicate is true of at least one value of a multi-value property then it is true for the property as whole.

10.3.2.3. Pseudo fields

The following fields are available on all document resources. They don't correspond to document fields and aren't configurable, that's why they're called pseudo-fields.

The names of these fields are synchronized with constants from the class BuiltinDocumentFields. Any use from java code should rely on these.

Constant	Field name	Description
FIELD_FULLTEXT	ecm:fulltext	The default full-text aggregator (string)
FIELD_DOC_PATH	ecm:path	The document path (string)
FIELD_DOC_NAME	ecm:name	The document name (last component of the path, string)
FIELD_DOC_URL	ecm:url	The document URL (string)
FIELD_DOC_REF	ecm:id	The `DocumentRef` as fetched from the core
FIELD_DOC_PARENT_REF	ecm:parentId	The parent `DocumentRef`
FIELD_DOC_TYPE	ecm:primaryType	The Core type (string)
FIELD_DOC_FACETS	ecm:mixinType	The facets (multiple)
FIELD_DOC_LIFE_CYCLE	ecm:currentLifeCycleState	The document life cycle (string)
FIELD_DOC_VERSION_LABEL	ecm:versionLabel	The version label (string)
FIELD_DOC_IS_CHECKED_IN_VERSION	ecm:isCheckedInVersion	A boolean (0 or 1) that states if document is a frozen version (not live nor proxy)
FIELD_DOC_IS_PROXY	ecm:isProxy	A boolean (0 or 1) that states if document is a proxy (targetting other documents)
FIELD_DOC_REPOSITORY_NAME	ecm:repositoryName	The document repository name (string)

10.4. The Compass plugin

Compass is a popular wrapper around Apache Lucene. This plugin allows to use it as a backend for the Search Service.

10.4.1. Configuring Compass

Compass configuration is split in a master XML configuration file and one or several mapping files. The latter specifies the treatments that resources and fields (properties in Compass terminology), while the former is to be used to tune JTA transactions, data sources, and to register configuration of analyzers and converters.

The contents of these files are covered in great details in the Compass 1.1 documentation. In the present documentation, we'll focus on integration matters with the Nuxeo Search Service.

10.4.1.1. Configuration files location

All Compass specific configuration files are relative to the classpath of the compass plugin. A default configuration is provided for Nuxeo EP 5 WebApp. To customize it, one sadly has to put the configuration at the right place within the backend's JAR.

Here is an ant fragment to perform this in a JBoss context, assuming that the Search Service has already been installed in the application server and that the server's deployment directory is stored in the deploy.dir property.

<copy todir="${deploy.dir}/nuxeo.ear/platform/nuxeo-platform-search-compass-plugin-1.0.0-SNAPSHOT.jar"
      overwrite="true" failonerror="false">
  <fileset dir="src/main/resources">
    <include name="myfile.xml" />
  </fileset>
</copy>

The backend's JAR is included as a directory in nuxeo.ear during the Maven build of nuxeo-platform-ear for this single purpose. This is prone to change in the future.

10.4.1.2. Specifying the master configuration XML file name

The Compass backend itself is registered against the Search Service through the searchEngineBackend extension point of org.nuxeo.ecm.core.search.service.SearchServiceImpl. Your component can use the configurationFileName element to specify a path to the master configuration file, like this:

<searchEngineBackend name="compass" default="true"
    class="org.nuxeo.ecm.core.search.backend.compass.CompassBackend">
  <configurationFileName>/mycompass.cfg.xml</configurationFileName>
</searchEngineBackend>

The default path is /compass.cfg.xml.

10.4.2. Global configuration

10.4.2.1. Storage

Compass supports several storage possibilities, called connections in Compass configuration objects. The configuration is done trough a Nuxeo Runtime extension point and possibly within the Compass master configuration file. The extension point always takes precedence over the Compass file, but can be used to fall back to Compass file, that offers currently more possibilites.

The target is org.nuxeo.ecm.core.search.backend.compass.CompassBackend, and the point is connection. Contributions are made of a single XML element; the latest one wins.

To set the connection to a file system Lucene store, put a file element in the contribution, and set the path attribute to the target location. If the path doesn't start with /, it will be interpreted as being relative to Nuxeo Runtime's home directory, e.g, /opt/jboss/server/default/data/NXRuntime in the default Nuxeo EP installation on JBoss.

Other connection types, notably JDBC, are defined by the Compass configuration file, one has to put the default XML element in the contribution, like this:

<extension target="org.nuxeo.ecm.core.search.backend.compass.CompassBackend"
           point="connection">
  <default/>
</extension>

The default connection is a relative file system one, hosted in the nxsearch-compass sub directory of Nuxeo Runtime's home.

10.4.2.2. Analyzers

The master configuration file holds the definition and configuration of analyzers: a lookup name gets associated to an analyzer type and options. The Compass backend makes Compass use directly the name declared in the Search Service as the lookup name, configuration, therefore one has to ensure here that all of these do exist in the Compass configuration.

Together with the registration itself comes the configuration of analyzers. For instance, an analyzer discarding stop words might be given the full list of stop words.

Compass comes with a two predefined analyzers: default and search. You can reconfigure them as well.

See the relevant part in Compass documentation for details and sample configurations.

10.4.2.3. Converters

Lucene only knows about character strings. Therefore, typed data such as dates and integers must be converted in strings to get into Lucene and back. Compass provides helpers for this and the Compass backend uses them directly.

In the master configuration file, one register available converters in the form of a lookup name and a Java class. Lots of converters are already registered by default, covering most basic types. The compass.cfg.xml file shipping with the Compass backend redefines one (the date converter).

10.4.3. Mappings for Nuxeo

For the time being, a part of the Search Service configuration has to be duplicated in the Compass mappings XML file.

10.4.3.1. What to describe and syntax

Currently, the Compass backend can't force Compass to use a given converter and/or analyzer on a given field. It must therefore be specified in the mappings file, which is itself loaded from the master configuration file. The default name for this file is nxdocument.cpm.xml.

Here's a sample, inspired from the mappings file provided with the backend.

<?xml version="1.0"?>
<!DOCTYPE compass-core-mapping PUBLIC
    "-//Compass/Compass Core Mapping DTD 1.0//EN"
    "http://www.opensymphony.com/compass/dtd/compass-core-mapping.dtd">

<compass-core-mapping>

  <resource alias="nxdoc"
            sub-index="nxdocs"
            analyzer="default"
            all="false">
    <resource-id name="nxdoc_id"/>

    <resource-property name="dc:created" converter="date"
                       store="yes" index="un_tokenized" />
    <resource-property name="dc:title" analyzer="french" />

  </resource>
</compass-core-mapping>

In Compass' terminology, fields are called properties. The name of the Compass property corresponding to a Nuxeo indexed field coincides with the field's prefixed name.

Some important remarks:

Start from the current mappings file shipping with your version of the compass backend and keep it up to date afterwards
Don't change anything besides resource-property elements.
An exception to the above rules: you may experiment with the sub-index attribute, according to your performance needs.

10.4.3.2. Installing the mappings file

Follow the instructions from Section 10.4.1.1, “Configuration files location”

10.4.4. Text fields behavior

Everything in this part applies to fields that have explicitly been declared with a "text" type through the extension point. Anything that's meant for text fields in the mappings configuration files will be ignored if the field has another type.

10.4.4.1. Indexing

At indexing time, the text field is analyzed using the analyzer from the Compass configuration file regardless what has been configured through the Search Service extension point.

10.4.4.2. Searching

Equality statements in WHERE clauses are transformed into the closest thing that Lucene can provide on an analyzed field, namely a phrase query.

On the other hand, LIKE clauses are directly fed to Lucene's QueryParser. To search documents whose title starts with "nux", one may write

SELECT * FROM Document WHERE dc:title LIKE 'nux*'

The following two statements are equivalent. The second one is the QueryParser syntax for phrase queries.

... WHERE dc:title='Nuxeo Book'
... WHERE dc:title LIKE '"Nuxeo Book"'

Lucene's QueryParser syntax is really powerful, you can specify how close two words can be, apply fine grained boosts for the relevance ordering, and more. The only restriction you have on LIKE statements for text fields within the Compass backend is the choice of field. In other words, the colon character is escaped.

You would need to query date fields, like creation, modification or expiration date, that are provided by Nuxeo Platform. In these cases, it would be interesting to use BETWEEN clauses, associated with DATE keyword, which allows to convert strings as date values.

Documents created in the first term of 2008:

... WHERE dc:created BETWEEN DATE '2008-01-01' AND DATE '2008-03-31'

Documents modified in may 2007:

... WHERE dc:modified BETWEEN DATE '2007-05-01' AND DATE '2007-05-31'

Example 10.5. Date queries

You should be aware of the following trap in Lucene queries: purely negative queries don't match anything, even if they are themselves nested in a boolean query that has positive statements. The Compass backend uses the standard way to circumvent this limitation, provided that the negative aspect can be seen from the NXQL structure, i.e., not enclosed in a Lucene QueryParser literal.

Queries that won't return anything:

... WHERE dc:title LIKE '-book'
... WHERE dc:title LIKE 'nuxeo' AND dc:title LIKE '-book'

Queries that should work as intended (the last three being equivalent):

... WHERE dc:title NOT LIKE 'book'
... WHERE dc:title LIKE 'nuxeo' AND dc:title NOT LIKE 'book'
... WHERE dc:title LIKE 'nuxeo' AND NOT dc:title LIKE 'book'
... WHERE dc:title LIKE '+nuxeo -book'

Example 10.6. Negative queries

Prev	Home	Next
Chapter 9. User Notification Service	Professional Open Source ECM by Nuxeo	Chapter 11. Look and feel