Table of Contents
indexableDocType
extension pointresource
extension pointThis chapter presents the architecture of the indexing and search service in Nuxeo EP 5.
For obvious performance and volume considerations, the search service doesn't index all the content of the application, nor does it provide the full content in search results. This must be specified in the configuration, along with what to do with the data.
The search service configuration is done by the standard Nuxeo
Runtime extension point system. The target is
org.nuxeo.ecm.core.search.service
For a complete
example, check the default configuration file:
nuxeo-platform-search-core/OSGI_INF/nxsearch-contrib.xml
(from Nuxeo EP sources).
The main concepts are Resource and Field. A Resource holds several fields. It has a name and a prefix, which is used, e.g, in queries. Resources are supposed to be unsplittable, but they are usually aggregated. It's up to the backend to decide how to handle that aggregation; this goes beyond the scope of the present documentation.
At this point the search service handles documents only, so it's safe to say that documents correspond to aggregates of resources, and resources correspond to schemas. In the future, there'll be more types of resources.
The core types of documents to index have to be registered against this extension point, in which the schemas to index are bound to each document type.
Here's an example demonstrating the available patterns:
<extension target="org.nuxeo.ecm.core.search.service.SearchServiceImpl" point="indexableDocType"> (...) <indexableDocType name="DocType1" indexAllSchemas="true"/> <indexableDocType name="DocType2" indexAllSchemas="true"> <exclude>unwanted_schema</exclude> </indexableDocType> <indexableDocType name="DocType3"> <resource>indexed_schema</resource> </indexableDocType> </extension>
In this example, the Search service will index all schemas for
documents of type DocType1
, all schemas except
unwanted_schema
for type
DocType2
and only
indexed_schema
for type
DocType3
.
Each ot these indexable schemas will be processed according to the corresponding indexable schema resource if available in the configuration, or to the default one .
In particular, the behavior of data from a given schema is homogeneous across all document types.
This extension point is used to declare an indexable resource. In
5.1M2, the only provided indexable resource type is the
schema
resource type. but the logic will stay the
same for future types. Recall that resources are made of
fields. Here's an example of schema indexing
resource, without fields details.
<extension target="org.nuxeo.ecm.core.search.service.SearchServiceImpl" point="resource" type="schema"> <resource name="dublincore" prefix="dc" indexAllFields="true"> <excludedField>issued</excludedField> <field name="title" (...) /> </resource> <resource name="book" prefix="bk" type="schema"> <field name="barcode" (...) /> </resource> </extension>
The type
specified that the resource is a
docuemnt schema resource.
The name
and prefix
attributes are mandatory. They should match the ones from the
schema
extension point of
org.nuxeo.ecm.core.schema.TypeService
as in,
e.g.:
<schema name="dublincore" prefix="dc" src="schema/dublincore.xsd"/> <schema name="common" src="schema/common.xsd" />
A
missing prefix in the core configuration as in the
common
schema declaration in the example above will
default to the schema name.
The prefix is important, since all subsequent references to the
fields (in queries and raw search results) take the
prefix:fieldName
form.
Fields behavior is designed to be uniform across the different kinds of resources.
The field type tells the search engine how to process the field. This is a mandatory, case-insensitive, attribute.
The following table summarizes the available types. The listed Java classes are guaranteed to work. The backend might implement more converting facilities.
Type | Java classes | Comment |
---|---|---|
Keyword | String | Meant for technical non binary strings such as vocabulary keys, user identifiers etc. Equality matches are guaranteed to be exact. |
Text | String | Upon indexation, these fields are tokenized and analyzed to support fulltext queries. |
Path | org.nuxeo.common.utils.Path, String | Dedicated to |
Date | java.util.Calendar | For date and time indexing |
Int | int | |
Builtin | Reserved for internal use |
At indexing time, the contents of text fields goes through the process of tokenization and analysis, whose main goal is to provide fulltext search capabilities, usually at the expense of exact matches.
Tokenization means converting the textual content in a stream of words (tokens). During the analysis step, this raw stream is altered to better suit the needs of fulltext searches. It is for example common practice to strip the stream of so-called stop words (most commons words in the language) and to degrade accented characters. One can apply further linguistical processing, such as stemming or synonym analysis.
The name of analyzer to use on a given text field is specified
through the analyzer
attribute. The Search Service
acts as a pure forwarder, sending the raw text to the backend, along
with the specified analyzer name.
The default value for the analyzer
attribute is
default
. The attribute is simply ignored for other field
types.
To enable queries on a given field, one must set the
indexed
attribute to
true
.
To make it possible to provide the full field value in search
results, one must set the stored
attribute to
true
. One must keep in mind that the purpose is
to present the user a limited yet sufficient set of information
along with search results, i.e, in a swifter way than having to
fetch it from the core, and not to duplicate the content in the
search databases.
Multivalued fields have to be flagged by setting the
multiple
attribute to
true
.
Depending on field types and on the processing that's been
done by the backend, the possibility to use the field value as a
sort key might require some additional resources. To force the
backend to give this extra effort, set the
sortable
attribute to
true
.
The binary attribute is used to mark binary fields, e.g, to trigger conversions (not used in 5.1M2).
For a schema resource that isn't explicitely declared and
nevertheless used, for instance because of an
indexAllSchemas
statement, a default
configuration is inferred, with the prefix read from the Core
configuration, and fields as below, i.e., as if there was an
indexAllFields="true"
attribute.
auto-configured fields are unstored
If there is only one relevant type from the table above, it is applied.
The multiple
attribute is properly
set.
auto-configured String fields get the
keyword
type.
<resource name="dublincore" prefix="dc" indexAllFields="true" type="schema"> <field name="title" analyzer="default" stored="true" indexed="true" type="Text" binary="false" sortable="true"/> </resource>
Example 11.1. Relying on automatic field configuration on all fields but one
The search service exposes the searchQuery
method as a unique entry point. The method takes as input an instance of
ComposedNXQuery
, which encapsulates the parsed NXQL
query and a SearchPrincipal
instance that will be
checked against the security indexes, and paging information for the
results.
Although the input of searchQuery
is an
already parsed NXQL statement, we'll use NXQL query strings in the sequel
for clarity. NXQL query strings are parsed by the method
parse
of the static
org.nuxeo.ecm.core.query.sql.SQLQueryParser
class.
Within NXQL requests, references to fields values have to follow the "prefix:fieldName" scheme, where prefix and fieldName have been specified through the configuration extension points (recall that for automatically indexed schema resources, the prefix defaults to the one defined in the schema definition).
Literals (constants) follow the JCR specifications. Notably:
String literals have to be enclosed in single quotes
Lists have to be enclosed in parenthesizes
Recall that "dc" is the prefix for the "dublincore" schema.
SELECT * FROM Document WHERE dc:title='Nuxeo book' ORDER BY dc:created DESC
Example 11.2. Sample NXQL queries
Most WHERE
statements behave as described in
the JCR specification, which is itself based on general SQL. Instead of
covering every aspect of NXQL, in the current state of this documentation,
we'll focus on differences and behaviours that might appear to be counter
intuitive.
Although the Search Service is meant to provide an unified abstraction on the tasks of indexing and querying, text fields have to be somewhat an exception. Indeed, search engines have very different capabilities, depending on provided analyzers. They are nonetheless all expected to provide a direct syntax for full text searches, that an end user can use from, e.g., an input box on a web page. Given the very special kind of constraint that indexing a text field represents, it's not guaranteed that exact matches are supported.
See Section 11.4.4, “Text fields behavior” from the documentation of the Compass backend to get a more concrete view on this (with examples).
The backend uses the closest thing to exact matches it
supports to treat =
predicates.
The syntax of LIKE
predicates is
backend dependent. It follows the backend's full text query
syntax
The CONTAINS
from JCR is
not supported. Use a
LIKE
statement on the main full-text field
(see Section 11.3.2.3, “Pseudo fields”).
On a multi-valued field, the =
operator is
true if the right operand belongs to the set of field values. The
IN
operator is true if the intersection between the
set of field values and the right operand is non empty.
SELECT * FROM Document WHERE dc:contributors = 'sally'
Example 11.3. Finding documents on which user sally contributed
SELECT * FROM Document WHERE dc:contributors IN ('sally', 'harry')
Example 11.4. Finding documents on which user sally or harry contributed
This behavior is in conformance with the JCR specification, which states it the following more general terms:
In the WHERE clause the comparison operators function the same way they do in XPath when applied to multi-value properties: if the predicate is true of at least one value of a multi-value property then it is true for the property as whole.
The following fields are available on all document resources. They don't correspond to document fields and aren't configurable, that's why they're called pseudo-fields.
The names of these fields are synchronized with constants from the
class BuiltinDocumentFields
. Any use from java
code should rely on these.
Constant | Field name | Description |
---|---|---|
FIELD_FULLTEXT | ecm:fulltext | The default full-text aggregator (string) |
FIELD_DOC_PATH | ecm:path | The document path (string) |
FIELD_DOC_NAME | ecm:name | The document name (last component of the path, string) |
FIELD_DOC_URL | ecm:url | The document URL (string) |
FIELD_DOC_REF | ecm:id | The DocumentRef as fetched from the
core |
FIELD_DOC_PARENT_REF | ecm:parentId | The parent DocumentRef |
FIELD_DOC_TYPE | ecm:primaryType | The Core type (string) |
FIELD_DOC_FACETS | ecm:mixinType | The facets (multiple) |
FIELD_DOC_LIFE_CYCLE | ecm:currentLifeCycleState | The document life cycle (string) |
FIELD_DOC_VERSION_LABEL | ecm:versionLabel | The version label (string) |
FIELD_DOC_IS_CHECKED_IN_VERSION | ecm:isCheckedInVersion | A boolean (0 or 1) that states if document is a frozen version (not live nor proxy) |
FIELD_DOC_IS_PROXY | ecm:isProxy | A boolean (0 or 1) that states if document is a proxy (targetting other documents) |
FIELD_DOC_REPOSITORY_NAME | ecm:repositoryName | The document repository name (string) |
Compass is a popular wrapper around Apache Lucene. This plugin allows to use it as a backend for the Search Service.
Compass configuration is split in a master XML configuration file and one or several mapping files. The latter specifies the treatments that resources and fields (properties in Compass terminology), while the former is to be used to tune JTA transactions, data sources, and to register configuration of analyzers and converters.
The contents of these files are covered in great details in the Compass 1.1 documentation. In the present documentation, we'll focus on integration matters with the Nuxeo Search Service.
All Compass specific configuration files are relative to the classpath of the compass plugin. A default configuration is provided for Nuxeo EP 5 WebApp. To customize it, one sadly has to put the configuration at the right place within the backend's JAR.
Here is an ant fragment to perform this in a JBoss context,
assuming that the Search Service has already been installed in the
application server and that the server's deployment directory is stored
in the deploy.dir
property.
<copy todir="${deploy.dir}/nuxeo.ear/system/nuxeo-platform-search-compass-plugin-1.0.0-SNAPSHOT.jar" overwrite="true" failonerror="false"> <fileset dir="src/main/resources"> <include name="myfile.xml" /> </fileset> </copy>
The backend's JAR is included as a directory in
nuxeo.ear
during the Maven build of
nuxeo-platform-ear
for this single purpose. This is
prone to change in the future.
The Compass backend itself is registered against the Search
Service through the searchEngineBackend
extension
point of
org.nuxeo.ecm.core.search.service.SearchServiceImpl
.
Your component can use the configurationFileName
element to specify a path to the master configuration file, like this:
<searchEngineBackend name="compass" default="true" class="org.nuxeo.ecm.core.search.backend.compass.CompassBackend"> <configurationFileName>/mycompass.cfg.xml</configurationFileName> </searchEngineBackend>
The default path is /compass.cfg.xml
.
Compass supports several storage possibilities, called connections in Compass configuration objects. The configuration is done trough a Nuxeo Runtime extension point and possibly within the Compass master configuration file. The extension point always takes precedence over the Compass file, but can be used to fall back to Compass file, that offers currently more possibilites.
The target is
org.nuxeo.ecm.core.search.backend.compass.CompassBackend
,
and the point is connection
. Contributions are made
of a single XML element; the latest one wins.
To set the connection to a file system Lucene store, put a
file
element in the contribution, and set the
path
attribute to the target location. If the path
doesn't start with /
, it will be interpreted as
being relative to Nuxeo Runtime's home directory, e.g,
/opt/jboss/server/default/data/NXRuntime
in the
default Nuxeo EP installation on JBoss.
Other connection types, notably JDBC, are defined by the Compass
configuration file, one has to put the default
XML
element in the contribution, like this:
<extension target="org.nuxeo.ecm.core.search.backend.compass.CompassBackend" point="connection"> <default/> </extension>
The default connection is a relative file system one, hosted in
the nxsearch-compass
sub directory of Nuxeo
Runtime's home.
The master configuration file holds the definition and configuration of analyzers: a lookup name gets associated to an analyzer type and options. The Compass backend makes Compass use directly the name declared in the Search Service as the lookup name, configuration, therefore one has to ensure here that all of these do exist in the Compass configuration.
Together with the registration itself comes the configuration of analyzers. For instance, an analyzer discarding stop words might be given the full list of stop words.
Compass comes with a two predefined analyzers:
default
and search
. You can
reconfigure them as well.
See the relevant part in Compass documentation for details and sample configurations.
Lucene only knows about character strings. Therefore, typed data such as dates and integers must be converted in strings to get into Lucene and back. Compass provides helpers for this and the Compass backend uses them directly.
In the master configuration file, one register available
converters in the form of a lookup name and a Java class. Lots of
converters are already registered by default, covering most basic types.
The compass.cfg.xml
file shipping with the Compass
backend redefines one (the date
converter).
For the time being, a part of the Search Service configuration has to be duplicated in the Compass mappings XML file.
Currently, the Compass backend can't force Compass to use a given
converter and/or analyzer on a given field. It must
therefore be specified in the mappings file, which is itself loaded from
the master configuration file. The default name for this file is
nxdocument.cpm.xml
.
Here's a sample, inspired from the mappings file provided with the backend.
<?xml version="1.0"?> <!DOCTYPE compass-core-mapping PUBLIC "-//Compass/Compass Core Mapping DTD 1.0//EN" "http://www.opensymphony.com/compass/dtd/compass-core-mapping.dtd"> <compass-core-mapping> <resource alias="nxdoc" sub-index="nxdocs" analyzer="default" all="false"> <resource-id name="nxdoc_id"/> <resource-property name="dc:created" converter="date" store="yes" index="un_tokenized" /> <resource-property name="dc:title" analyzer="french" /> </resource> </compass-core-mapping>
In Compass' terminology, fields are called properties. The name of the Compass property corresponding to a Nuxeo indexed field coincides with the field's prefixed name.
Some important remarks:
Start from the current mappings file shipping with your version of the compass backend and keep it up to date afterwards
Don't change anything besides
resource-property
elements.
An exception to the above rules: you may experiment with the
sub-index
attribute, according to your
performance needs.
Follow the instructions from Section 11.4.1.1, “Configuration files location”
Everything in this part applies to fields that have explicitly been declared with a "text" type through the extension point. Anything that's meant for text fields in the mappings configuration files will be ignored if the field has another type.
At indexing time, the text field is analyzed using the analyzer from the Compass configuration file regardless what has been configured through the Search Service extension point.
Equality statements in WHERE clauses are transformed into the closest thing that Lucene can provide on an analyzed field, namely a phrase query.
On the other hand, LIKE clauses are directly fed to Lucene's
QueryParser
. To search documents whose title
starts with "nux", one may write
SELECT * FROM Document WHERE dc:title LIKE 'nux*'
The following two statements are equivalent. The second one is the QueryParser syntax for phrase queries.
... WHERE dc:title='Nuxeo Book' ... WHERE dc:title LIKE '"Nuxeo Book"'
Lucene's QueryParser
syntax is really
powerful, you can specify how close two words can be, apply fine grained
boosts for the relevance ordering, and more. The only restriction you
have on LIKE statements for text fields within the Compass backend is
the choice of field. In other words, the colon character is
escaped.
You would need to query date fields, like creation, modification or expiration date, that are provided by Nuxeo Platform. In these cases, it would be interesting to use BETWEEN clauses, associated with DATE keyword, which allows to convert strings as date values.
Documents created in the first term of 2008:
... WHERE dc:created BETWEEN DATE '2008-01-01' AND DATE '2008-03-31'
Documents modified in may 2007:
... WHERE dc:modified BETWEEN DATE '2007-05-01' AND DATE '2007-05-31'
Example 11.5. Date queries
You should be aware of the following trap in Lucene queries: purely negative queries don't match anything, even if they are themselves nested in a boolean query that has positive statements. The Compass backend uses the standard way to circumvent this limitation, provided that the negative aspect can be seen from the NXQL structure, i.e., not enclosed in a Lucene QueryParser literal.
Queries that won't return anything:
... WHERE dc:title LIKE '-book' ... WHERE dc:title LIKE 'nuxeo' AND dc:title LIKE '-book'
Queries that should work as intended (the last three being equivalent):
... WHERE dc:title NOT LIKE 'book' ... WHERE dc:title LIKE 'nuxeo' AND dc:title NOT LIKE 'book' ... WHERE dc:title LIKE 'nuxeo' AND NOT dc:title LIKE 'book' ... WHERE dc:title LIKE '+nuxeo -book'
Example 11.6. Negative queries