Metadata Extraction Service

Table of Contents

22.1. Introduction
22.2. Metadata extraction module
22.2.1. Defining a contribution for metadata extraction
22.2.2. Specifying input parameters
22.2.3. Chaining extractions
22.2.4. Creating a plugin for metadata extraction
22.2.5. Using a metadata extraction plugin

22.1. Introduction

There are cases when some informations could be retrieved from attached files and keep this info as regular Document properties. This info is refered to as metadata as it is a descriptive info refering the document (file) from which it is extracted. For example given a MS Word document (file) we could retrieve info (metadata) like author, title, creation date etc, simply by reading document headers with an appropriate library (that knows how to parse and keep a MS Word document internal structure).

22.2. Metadata extraction module

The metadata extraction feature is composed of several projects:

- nuxeo-platform-metadataext-api

- nuxeo-platform-metadataext-core

- nuxeo-platform-metadataext-facade

- nuxeo-platform-metadataext-plugins

The core part defines the metadata extraction component with a MetaDataExtractionManager implementation: org.nuxeo.ecm.platform.metadataext.services.MetaDataExtractionService.

The service is normally invoked by a dedicated CoreListener that passes a DocumentModel for which an extraction could be defined. Also the metadata could be extracted by direct invocation from a client code through existing EJB3 facade.

The plugin part provides a plugin (more to come) which is specific to the MS-Word document type. The plugin is a Transformation Plugin (as defined by the Transformation service) and gets invoked as part of a defined transformation chain. The apropriate transformation will be called by the metadata extraction service if there is a contribution for the given Document type, etc.

22.2.1. Defining a contribution for metadata extraction

A contribution can be defined by adding an XML file with the following structure:

<extension
    target="org.nuxeo.ecm.platform.metadataext.services.MetaDataExtraction"
    point="extractions">

  <meta-data-extraction inputField="file:content"
      transformationName="MSWordMDExt">

    <outputParams>
      <param propertyName="dc:title">title</param>
      <param propertyName="dc:description">comments</param>
      <param propertyName="dc:created">creationDate</param>
      <param propertyName="dc:modified">creationDate</param>
      <param propertyName="dc:contributors">authors</param>
    </outputParams>

    <coreEvent>documentCreated</coreEvent>
    <coreEvent>documentModified</coreEvent>

    <docType>File</docType>

  </meta-data-extraction>

</extension>

The important aspects for MetaDataExtraction service are:

- specification of a source (blob) from where metadata will be extracted. This is defined by the value inputField which should be a blob document property.

- the mapping of output parameters. This defines a corespondence between the map entries returned as the result of transformation (extraction) and the Document properties names to which the results will be written back.

- the list of core events for which the extraction will be performed.

- the list of document types for which the extraction can be applied.

22.2.2. Specifying input parameters

If the extraction is requiring additional information aside from the provided blob, we can add input parameters to an extraction definition like this:

<inputParams>
  <param propertyName="dc:title">title</param>
</inputParams>

A map with input parameters will be passed to the extraction plugin having the key the value of tag param (in this case 'title').

22.2.3. Chaining extractions

In extreme cases the extractions of metadata could be performed in several steps. That is using the extracted parameters in phase as input parameters for the next phase. This could be achieved by specifying the input params whose values could have been set by a previous extraction (transformation).

22.2.4. Creating a plugin for metadata extraction

We first define a class implementing org.nuxeo.ecm.platform.transform.interfaces.Plugin interface like:

public class MSWordMDExtractorPlugin extends AbstractPlugin {

    @Override
    public List<TransformDocument> transform(Map<String, Serializable> options,
            TransformDocument... sources) throws Exception {
      ...
    }
...

}

Overriding the transform method is a must and in this case we will have only one TransformDocument as source.

The blob from which metadata could be extracted can be retrieved from TransformDocument like: Blob srcBlob = sources[0].getBlob().

After the useful information is extracted the properties should be set to a TransformDocument that will be returned:

TransformDocumentImpl res = new TransformDocumentImpl();
res.setPropertyValue("title", extractedTitle);

22.2.5. Using a metadata extraction plugin

Having a plugin class defined that knows how to extract information from a specific type of file (MS Word document, MS Excel, PDF, OO doc etc) we can add the contributions necessary for the metadata extraction to take place.

We need to define first the transformation plugin and the transformation contributions. After that the defined transformation can be refered in the metadata extraction specific contribution.

Step 1: define the transformation plugin:

<extension
    target="org.nuxeo.ecm.platform.transform.service.TransformService"
    point="plugins">

  <documentation>
    Set of default transformation plugins for metadata extraction.
  </documentation>

  <plugin name="MSWordMDExtPlugin"
      class="org.nuxeo.ecm.platform.metadataext.plugins.MSWordMDExtractorPlugin"
      destinationMimeType="application/msword">
    <sourceMimeType>application/msword</sourceMimeType>
  </plugin>
</extension>

Step 2: define the transformation:

<extension
    target="org.nuxeo.ecm.platform.transform.service.TransformService"
    point="transformers">

  <documentation>
    Set of default transformation chains for metadata extraction.
  </documentation>

  <transformer name="MSWordMDExt"
      class="org.nuxeo.ecm.platform.transform.transformer.TransformerImpl">
    <plugins>
      <plugin name="MSWordMDExtPlugin" />
    </plugins>
  </transformer>
</extension>

Step 3: define the metadata extraction specific contribution

<extension
    target="org.nuxeo.ecm.platform.metadataext.services.MetaDataExtraction"
    point="extractions">

  <meta-data-extraction inputField="file:content"
      transformationName="MSWordMDExt">

...

This last one is refering to the above defined transformation.