MainOverviewWikiIssuesForumBuildFisheye

Chapter 7. XSEM - Xml to Search Engine Mapping

7.1. Introduction

Compass provides the ability to map XML structure to the underlying Search Engine through simple XML mapping files, we call this technology XSEM (XML to Search Engine Mapping). XSEM provides a rich syntax for describing XML mappings using Xpath expressions. The XSEM files are used by Compass to extract the required xml elements from the xml structure at run-time and inserting the required meta-data into the Search Engine index.

7.2. Xml Object

At the core of XSEM supports is XmlObject abstraction on top of the actual XML library implementation. The XmlObject represents an XML element (document, node, attribute, ...) which is usually the result of an Xpath expression. It allows to get the name and value of the given element, and execute Xpath expressions against it (for more information please see the XmlObject javadoc).

Here is an example of how XmlObject is used with Compass:

CompassSession session = compass.openSession();
// ...
XmlObject xmlObject = // create the actual XmlObject implementation (we will see how soon)
session.save("alias", xmlObject);

An extension to the XmlObject interface is the AliasedXmlObject interface. It represents an xml object that is also associated with an alias. This means that saving the object does not require to explicitly specify the alias that it will be saved under.

CompassSession session = compass.openSession();
// ...
AliasedXmlObject xmlObject = // create the actual XmlObject implementation (we will see how soon)
session.save(xmlObject);

Compass comes with support for dom4j and JSE 5 xml libraries, here is an example of how to use dom4j API in order to create a dom4j xml object:

CompassSession session = compass.openSession();
// ...
SAXReader saxReader = new SAXReader();
Document doc = saxReader.read(new StringReader(xml));
AliasedXmlObject xmlObject = new Dom4jAliasedXmlObject(alias, doc.getRootElement());
session.save(xmlObject);

And here is a simple example of how to use JSE 5:

CompassSession session = compass.openSession();
// ...
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml)));
AliasedXmlObject xmlObject = NodeAliasedXmlObject(alias, doc);
session.save(xmlObject);

7.3. Xml Content Handling

Up until now, Compass has no knowledge of how to parse and create an actual XmlObject implementation, or how to convert an XmlObject into its xml representation. This is perfectly fine, but it also means that systems will not be able to work with XmlObject for read/search operations. Again, this is perfectly ok for some application, since they can always work with the underlying Resource representation, but some applications would still like to store the actual xml content in the search engine, and work with the XmlObject for read/search operations.

Compass XSEM support allows to define the xml-content mapping (defined below), which will cause Compass to store the xml representation in the search engine as well. It will also mean that for read/search operations, the application will be able to get an XmlObject back (for example, using CompassSession#get operation).

In order to support this, Compass must be configured with how to parse the xml content into an XmlObject, and how to convert an XmlObject into an xml string. Compass comes with built in converters that do exactly that:

Table 7.1. Compass XmlContentConverters

XmlContentConverterNameDescription
org.compass.core.xml.javax.converter. NodeXmlContentConverterjavax-nodeSupport for JSE 5 xml libraries. Not recommended on account of performance.
org.compass.core.xml.javax.converter. StaxNodeXmlContentConverterjavax-staxSupport for JSE 5 xml libraries. Parses the Document model using StAX. Not recommended on account of performance.
org.compass.core.xml.dom4j.converter. SAXReaderXmlContentConverterdom4j-saxSupport dom4j SAXReader for parsing, and XMLWriter to write the raw xml data.
org.compass.core.xml.dom4j.converter. XPPReaderXmlContentConverterdom4j-xppSupport dom4j XPPReader for parsing, and XMLWriter to write the raw xml data.
org.compass.core.xml.dom4j.converter. XPP3ReaderXmlContentConverterdom4j-xpp3Support dom4j XPP3Reader for parsing, and XMLWriter to write the raw xml data.
org.compass.core.xml.dom4j.converter. STAXReaderXmlContentConverterdom4j-staxSupport dom4j STAXEventReader for parsing, and XMLWriter to write the raw xml data.
org.compass.core.xml.jdom.converter. SAXBuilderXmlContentConverterjdom-saxSupport JDOM SAXBuilder for parsing, and XMLOutputter to write the raw xml data.
org.compass.core.xml.jdom.converter. STAXBuilderXmlContentConverterjdom-staxSupport JDOM STAXBuilder for parsing, and XMLOutputter to write the raw xml data.

Most of the time, better performance can be achieved by pooling XmlContentConverters implementations. Compass handling of XmlContentConverter allows for three different instantiation models: prototype, pool, and singleton. prototype will create a new XmlContentConverter each time, a singleton will use a shared XmlContentConverter for all operations, and pooled will pool XmlContentConverter instances. The default is prototype.

Here is an example of a Compass schema based configuration that registers a global Xml Content converter:

<compass-core-config ...
   <compass name="default">

       <connection>
           <file path="target/test-index" />
       </connection>
       
       <settings>
           <setting name="compass.xsem.contentConverter.type" value="jdom-stax" />
           <setting name="compass.xsem.contentConverter.wrapper" value="pool" />
       </settings>

   </compass>
</compass-core-config>

And here is an example of properties based configuration:

compass.xsem.contentConverter.type=jdom-stax
compass.xsem.contentConverter.wrapper=pool    

And last, here is how it can be configured it programmatically:

settings.setSetting(CompassEnvironment.Xsem.XmlContent.TYPE, CompassEnvironment.Xsem.XmlContent.Jdom.TYPE_STAX);

Note, that specific converters can be associated with a specific xml-object mapping, in order to do it, simply register the converter under a different name (compass.converter.xmlContentMapping is the default name that Compass will use when nothing is configured), and use that name in the converter attribute of the xml-content mapping.

Based on internal performance testing, the preferable configuration is a pooled converter that uses either dom4j or JDOM with a pull parser (StAX or XPP).

7.4. Raw Xml Object

If Compass is configured with an Xml Content converter, it now knows how to parse an xml content into an XmlObject. This allows us to simplify more the creation of XmlObjects from a raw xml data. Compass comes with a wrapper XmlObject implementation, which handles raw xml data (non parsed one). Here is how it can be used:

Reader xmlData = // construct an xml reader over raw xml content
AliasedXmlObject xmlObject = RawAliasedXmlObject(alias, xmlData);
session.save(xmlObject);

Here, Compass will identify that it is a RawAliasedXmlObject, and will used the registered converter (or the one configured against the xml-content mapping for the given alias) to convert it to the appropriate XmlObject implementation. Note, that when performing any read/search operation, the actual XmlObject that will be returned is the onc the the registered converter creates, and not the raw xml object.

7.5. Mapping Definition

XML/Search Engine mappings are defined in an XML document, and maps XML data structures. The mappings are xml centric, meaning that mappings are constructed around XML data structures themselves and not internal Resources. If we take the following as a sample XML data structure:

<xml-fragment>
     <data>
         <id value="1"/>
         <data1 value="data11attr">data11</data1>
         <data1 value="data12attr">data12</data1>
     </data>
     <data>
         <id value="2"/>
         <data1 value="data21attr">data21</data1>
         <data1 value="data22attr">data22</data1>
     </data>
</xml-fragment>

We can map it using the following XSEM xml mapping definition file:

<?xml version="1.0"?>
<!DOCTYPE compass-core-mapping PUBLIC
      "-//Compass/Compass Core Mapping DTD 2.2//EN"
      "http://www.compass-project.org/dtd/compass-core-mapping-2.2.dtd">

<compass-core-mapping>

  <xml-object alias="data1" xpath="/xml-fragment/data[1]">
      <xml-id name="id" xpath="id/@value" />
      <xml-property xpath="data1/@value" />
      <xml-property name="eleText" xpath="data1" />
  </xml-object>

  <xml-object alias="data2" xpath="/xml-fragment/data">
      <xml-id name="id" xpath="id/@value" />
      <xml-property xpath="data1/@value" />
      <xml-property name="eleText" xpath="data1" />
  </xml-object>

  <xml-object alias="data3" xpath="/xml-fragment/data">
      <xml-id name="id" xpath="id/@value" />
      <xml-property xpath="data1/@value" />
      <xml-property name="eleText" xpath="data1" />
      <xml-content name="content" />
  </xml-object>

</compass-core-mapping>

Or using the following JSON based mapping configuration:

{
  "compass-core-mapping" : {
    xml : [
      {
        alias : "data1",
        xpath : "/xml-fragment/data[1]",
        id : { name : "id", xpath : "id/@value" }.
        property : [
          { xpath : "data1/@value" },
          { name : "eleText", xpath : "data1" }
        ]
      },
      {
        alias : "data2",
        xpath : "/xml-fragment/data",
        id : { name : "id", xpath : "id/@value" },
        property : [
          { xpath : "data1/@value" },
          { name : "eleText", xpath : "data1" }
        ]
      },
      {
        alias : "data3",
        xpath : "/xml-fragment/data",
        id : { name : "id", xpath : "id/@value" }.
        property : [
          { xpath : "data1/@value" },
          { name : "eleText", xpath : "data1" }
        ],
        content : { name : "content" }
      }
    ]
  }   
}

Or last, we can use programmatic builder API to construct the mapping:

import static org.compass.core.mapping.xsem.builder.XSEM.*;

conf.addMapping(
    xml("data1").xpath("/xml-fragment/data[1]")
        .add(id("id/@value").indexName("id"))
        .add(property("data1/@value"))
        .add(proeprty("data1").indexName("eleText"))
);

conf.addMapping(
    xml("data2").xpath("/xml-fragment/data")
        .add(id("id/@value").indexName("id"))
        .add(property("data1/@value"))
        .add(proeprty("data1").indexName("eleText"))
);

conf.addMapping(
    xml("data3").xpath("/xml-fragment/data")
        .add(id("id/@value").indexName("id"))
        .add(property("data1/@value"))
        .add(proeprty("data1").indexName("eleText"))
        .add(content("content"))
);

The mapping definition here shows three different mappings (that will work with the sample xml). The different mappings are registered under different aliases, where the alias acts as the connection between the actual XML saved and the mappings definition.

The xml mapping also supports xpath with namespaces easily. For example, if we have the following xml fragment:

<xml-fragment>
    <data xmlns="http://mynamespace.org">
        <id value="1"/>
        <data1 value="data11attr">data11</data1>
        <data1 value="data12attr">data12</data1>
    </data>
</xml-fragment>

We can define the following mapping:

<?xml version="1.0"?>
<!DOCTYPE compass-core-mapping PUBLIC
     "-//Compass/Compass Core Mapping DTD 2.2//EN"
     "http://www.compass-project.org/dtd/compass-core-mapping-2.2.dtd">

<compass-core-mapping>

 <xml-object alias="data1" xpath="/xml-fragment/mynamespace:data">
     <xml-id name="id" xpath="mynamespace:id/@value" />
 </xml-object>
</compass-core-mapping>

In this case, we need to define the mapping between the mynamespace prefix used in the xpath definition and the http://mynamespace.org URI. Within the Compass configuration, a simple setting should be set: compass.xsem.namespace.mynamespace.uri=http://mynamespace.org. Other namespaces can be added using similar settings: compass.xsem.namespace.[prefix].uri=[uri].

An xml-object mapping can have an associated xpath expression with it, which will narrow down the actual xml elements that will represent the top level xml object which will be mapped to the search engine. A nice benefit here, is that the xpath can return multiple xml objects, which in turn will result in multiple Resources saved to the search engine.

Each xml object mapping must have at least one xml-id mapping definition associated with it. It is used in order to update/delete existing xml objects.

In the mapping definition associated with data3 alias, the xml-content mapping is used, which stores the actual xml content in the search engine as well. This will allow to unmarshall the xml back into an XmlObject representation. For the first two mappings (data1 and data2), search/read operations will only be able to work on the Resource level.

7.5.1. Converters

Actual value mappings (the xml-property) can use the extensive converters that come built in with Compass. Xml "value converter" are a special case since in their case, normalization needs to be done by converting a String to a normalized String. For example, a number with a certain format may need to be normalized into a number with a different (padded for example) format.

For example, lets use the following xml:

<xml-fragment>
   <data>
       <id value="1"/>
       <data1 value="21.2">03-12-2001</data1>
   </data>
</xml-fragment>

We can define the following mapping:

<?xml version="1.0"?>
<!DOCTYPE compass-core-mapping PUBLIC
    "-//Compass/Compass Core Mapping DTD 2.2//EN"
    "http://www.compass-project.org/dtd/compass-core-mapping-2.2.dtd">

<compass-core-mapping>

  <xml-object alias="data6" xpath="/xml-fragment/data">
      <xml-id name="id" xpath="id/@value" />
      <xml-property xpath="data1/@value" format="000000.00" value-converter="float" />
      <xml-property name="eleText" xpath="data1" format="yyyy-MM-dd||dd-MM-yyyy" value-converter="date" />
      <xml-content name="content" />
  </xml-object>
</compass-core-mapping>

In this case, we create a mapping under alias name data6. We can see the the value attribute of data1 element is a float number. When we index it, we would like to convert it to the following format: 000000.00. We can see, within the mappings, that we defined the converter to be of a float type and with the requested format. The actual data1 element text is of type date, and again we configure the converter to be of type date, and we use the support for "format" based converters in Compass to accept several formats (the first is the one used when converting from an Object). So, in the date case, the converter will try and convert the 03-12-2001 date and will succeed thanks to the fact that the second date format match it. It will then convert the Date created back to a String using the first format, giving us a searchable date format indexed.

7.5.2. xml-object

You may declare a xml object mapping using the xml-object element:

<xml-object
       alias="aliasName"
       sub-index="sub index name"
       xpath="optional xpath expression"
       analyzer="name of the analyzer"
 />
     all?,
     sub-index-hash?,
     xml-id*,
     (xml-analyzer?),
     (xml-boost?),
     (xml-property)*,
     (xml-content?)
 

Table 7.2. xml-object mapping

AttributeDescription
aliasThe name of the alias that represents the XmlObject.
sub-index (optional, defaults to the alias value)The name of the sub-index that the alias will map to.
xpath (optional, will not execute an xpath expression if not specified)An optional xpath expression to narrow down the actual xml elements that will represent the top level xml object which will be mapped to the search engine. A nice benefit here, is that the xpath can return multiple xml objects, which in turn will result in multiple Resources saved to the search engine.
analyzer (optional, defaults to the default analyzer)The name of the analyzer that will be used to analyze ANALYZED properties. Defaults to the default analyzer which is one of the internal analyzers that comes with Compass. Note, that when using the xml-analyzer mapping (a child mapping of xml object mapping) (for an xml element that controls the analyzer), the analyzer attribute will have no effects.

7.5.3. xml-id

Mapped XmlObject's must declare at least one xml-id. The xml-id element defines the XmlObject (element, attribute, ...) that identifies the root XmlObject for the specified alias.

<xml-id
      name="the name of the xml id"
      xpath="xpath expression"
      value-converter="value converter lookup name"
      converter="converter lookup name"
/>

Table 7.3. xml-id mapping

AttributeDescription
nameThe name of the xml-id. Will be used when constructing the xml-id internal path.
xpathThe xpath expression used to identify the xml-id. Must return a single xml element.
value-converter (optional, default to Compass SimpleXmlValueConverter)The global converter lookup name registered with the configuration. This is a converter associated with converting the actual value of the xml-id. Acts as a convenient extension point for custom value converter implementation (for example, date formatters). SimpleXmlValueConverter will usually act as a base class for such extensions.
converter (optional)The global converter lookup name registered with the configuration. The converter will is responsible to convert the xml-id mapping.

An important note regarding the xml-id mapping, is that it will always at as an internal Compass Property. This means that if one wish to have it as part of the searchable content, it will have to be mapped with xml-property as well.

7.5.4. xml-property

Declaring and using the xml-property element.

<xml-property
      xpath="xpath expression"
      name="optionally the name of the xml property"
      store="yes|no|compress"
      index="analyzed|not_analyzed|no"
      boost="boost value for the property"
      analyzer="name of the analyzer"
      reverse="no|reader|string"
      override="true|false"
      exclude-from-all="no|yes|no_analyzed"
      value-converter="value converter lookup name"
      format="a format string for value converters that support this"
      converter="converter lookup name"
/>

Table 7.4. xml-property mapping

AttributeDescription
name (optional, will use the xml object (element, attribute, ...) name if not set)The name that the value will be saved under. It is optional, and if not set, will use the xml object name (the result of the xpath expression).
xpathThe xpath expression used to identify the xml-property. Can return no xml objects, one xml object, or many xml objects.
store (optional, defaults to yes)If the value of the xml property is going to be stored in the index.
index (optional, defaults to analyzed)If the value of the xml property is going to be indexed (searchable). If it does, than controls if the value is going to be broken down and analyzed (analyzed), or is going to be used as is (not_analyzed).
boost (optional, defaults to 1.0f)Controls the boost level for the xml property.
analyzer (optional, defaults to the xml mapping analyzer decision scheme)The name of the analyzer that will be used to analyze ANALYZED xml property mappings defined for the given property. Defaults to the xml mapping analyzer decision scheme based on the analyzer set, or the xml-analyzer mapping.
exclude-from-all (optional, default to no)Excludes the property from participating in the "all" meta-data. If set to no_analyzed, not_analyzed properties will be analyzed when added to the all property (the analyzer can be controlled using the analyzer attribute).
override (optional, defaults to true)If there is another definition with the same mapping name, if it will be overridden or added as additional mapping. Mainly used to override definitions made in extended mappings.
reverse (optional, defaults to no)The meta-data will have it's value reversed. Can have the values of no - no reverse will happen, string - the reverse will happen and the value stored will be a reversed string, and reader - a special reader will wrap the string and reverse it. The reader option is more performant, but the store and index settings will be discarded.
value-converter (optional, default to Compass SimpleXmlValueConverter)The global converter lookup name registered with the configuration. This is a converter associated with converting the actual value of the xml-id. Acts as a convenient extension point for custom value converter implementation (for example, date formatters). SimpleXmlValueConverter will usually act as a base class for such extensions.
converter (optional)The global converter lookup name registered with the configuration. The converter will is responsible to convert the xml-property mapping.

7.5.5. xml-analyzer

Declaring an analyzer controller property using the xml-analyzer element.

<xml-analyzer
       name="property name"
       xpath="xpath expression"
       null-analyzer="analyzer name if value is null"
       converter="converter lookup name"
 >
 </xml-analyzer>

Table 7.5. xml-analyzer mapping

AttributeDescription
nameThe name of the xml-analyzer (results in a Property).
xpathThe xpath expression used to identify the xml-analyzer. Must return a single xml element.
null-analyzer (optional, defaults to error in case of a null value)The name of the analyzer that will be used if the property has a null value, or the xpath expression returned no elements.
converter (optional)The global converter lookup name registered with the configuration.

The analyzer xml property mapping, controls the analyzer that will be used when indexing the XmlObject. If the mapping is defined, it will override the xml object mapping analyzer attribute setting.

If, for example, Compass is configured to have two additional analyzers, called an1 (and have settings in the form of compass.engine.analyzer.an1.*), and another called an2. The values that the xml property can hold are: default (which is an internal Compass analyzer, that can be configured as well), an1 and an2. If the analyzer will have a null value, and it is applicable with the application, a null-analyzer can be configured that will be used in that case. If the resource property has a value, but there is not matching analyzer, an exception will be thrown.

7.5.6. xml-boost

Declaring a dynamic boost mapping controlling the boost level using the xml-boost element.

<xml-analyzer
name="property name"
xpath="xpath expression"
default="the boost default value when no property value is present"
converter="converter lookup name"
>
</xml-analyzer>

Table 7.6. xml-analyzer mapping

AttributeDescription
nameThe name of the xml-analyzer (results in a Property).
xpathThe xpath expression used to identify the xml-analyzer. Must return a single xml element.
default (optional, defaults to 1.0)The default boost value if no value is found.
converter (optional)The global converter lookup name registered with the configuration.

The boost xml property mapping, controls the boost associated with the Resource created based on the mapped property. The value of the property should be allowed to be converted to float.

7.5.7. xml-content

Declaring an xml content mapping using the xml-content element.

<xml-content
       name="property name"
       store="yes|compress"
       converter="converter lookup name"
 >
 </xml-content>

Table 7.7. xml-content mapping

AttributeDescription
nameThe name the xml content will be saved under.
store (optional, defaults to yes)How to store the actual xml content.
converter (optional)The global converter lookup name registered with the configuration.

The xml-content mapping causes Compass to store the actual xml content in the search engine as well. This will allow to unmarshall the xml back into an XmlObject representation. For xml-object mapping without an xml-content mapping, search/read operations will only be able to work on the Resource level.