Content Transformation

Table of Contents

18.1. Introduction
18.2. Plugins module
18.2.1. Creating a plugin
18.2.2. Declaring a plugin module
18.2.3. Using a transform plugin
18.3. Available transforms
18.3.1. Document conversion
18.3.2. Pdfbox
18.3.3. OLE objects extraction
18.3.4. Office files merger
18.3.5. XSL Transformation

18.1. Introduction

Transforms are operations that perform any action modifying an input document. This can cover file conversion as well as mail merging or information extraction.

18.2. Plugins module

The various plugins are stored in the nuxeo-plateform-transforms-plugins.* module. They are declared as a contribution of the org.nuxeo.ecm.platform.transform.service.TransformService component at extension point plugins.

The plugins are the engines that perform the needed transformations.

18.2.1. Creating a plugin

A transform plugin follows a framework defined in the nuxeo-plateform-transforms-core module. It is basically a class that extends a org.nuxeo.ecm.platform.transform.interfaces.Plugin. The main common method that has to be exposed is transform

    List<TransformDocument> transform(
            Map<String, Serializable> options, TransformDocument... sources);

The transform method returns a list of TransformDocument and accepts an options map and TransformDocument sources. TransformDocument object is defined in org.nuxeo.ecm.platform.transform.interfaces.TransformDocument and holds the binary as well as other information such as mimetype.

The options map holds all the necessary options to be passed to the plugin. Please note that the keys are the plugin names. See the officeMerger plugin below for an implementation example.

18.2.2. Declaring a plugin module

We first declare a contribution to the extension point plugins of org.nuxeo.ecm.platform.transform.service.TransformService:

<extension target="org.nuxeo.ecm.platform.transform.service.TransformService" point="plugins">
  <plugin name="any2odt"
      class="org.nuxeo.ecm.platform.transform.plugin.joooconverter.impl.JOOoConvertPluginImpl"
      destinationMimeType="application/vnd.oasis.opendocument.text">
    <sourceMimeType>text/xml</sourceMimeType>
    <sourceMimeType>text/plain</sourceMimeType>
    <sourceMimeType>text/rtf</sourceMimeType>

    <!-- Microsoft office documents -->
    <sourceMimeType>application/msword</sourceMimeType>

    <!-- OpenOffice.org 1.x documents -->
    <sourceMimeType>application/vnd.sun.xml.writer</sourceMimeType>
    <sourceMimeType>application/vnd.sun.xml.writer.template</sourceMimeType>

    <!-- OpenOffice.org 2.x documents -->
    <sourceMimeType>application/vnd.oasis.opendocument.text</sourceMimeType>
    <sourceMimeType>application/vnd.oasis.opendocument.text-template</sourceMimeType>

    <option name="ooo_host_name">localhost</option>
    <option name="ooo_host_port">8100</option>
  </plugin>
</extension>

The name attribute will be used to declare the transform chain. The class attribute is the class that will do the effective job of the transformation while the destinationMimeType is the mime-type of the result of the transform.

After attributes, <sourceMimeType> nodes define the allowed input mime-types the transform is supporting. In the presented case, we can see that the any2odt plugin will be able to handle text, Microsoft office word , OOo 1.x and OpenDocument format (OOo2.x) files to output an application/vnd.oasis.opendocument.text OpenDocument file. Options can also be added as we see for <option> tags with ooo_host_name and ooo_host_port attributes.

Plugins can be combined to build transform chains. This chains are declared in a transformer which is a contribution to the extension-point transformers of org.nuxeo.ecm.platform.transform.service.TransformService component.

<extension target="org.nuxeo.ecm.platform.transform.service.TransformService"
           point="transformers">
  <transformer name="any2text"
               class="org.nuxeo.ecm.platform.transform.transformer.TransformerImpl">
    <plugins>
      <plugin name="any2pdf"/>
      <plugin name="pdf2text"/>
    </plugins>
  </transformer>
</extension>

A transformer is defined by its name. This is this name that will be used to initialize a transform service when using it.

Then, plugins involved in the chain are listed. Wen can see that our any2txt transformer is composed with two chained plugins: any2pdf then pdf2txt. Obviously, a single plugin for a transform is legal as we can see with the use of our previous any2odt plugins.

<extension target="org.nuxeo.ecm.platform.transform.service.TransformService"
           point="transformers">
  <!-- This transformer uses a the OOo plugin to transform documents to ODT-->
  <transformer name="any2odt"
               class="org.nuxeo.ecm.platform.transform.transformer.TransformerImpl">
    <plugins>
      <plugin name="any2odt"/>
    </plugins>
  </transformer>
</extension>

18.2.3. Using a transform plugin

Once a transform plugin has been declared and the transformer is known, we can use it to perform various transformation actions.

TransformService service = NXTransform.getTransformService();
Transformer transformer = service.getTransformerByName("any2pdf");

List<TransformDocument> results = transformer.transform(null,
        new TransformDocumentImpl(sourceStream, "application/vnd.oasis.opendocument.text"));

SerializableInputStream resultStream = results.get(0).getStream();

We first get the TransformService that exposes all the available transforms. Then a specific transformer is built with the getTransformerByName method of the service. The name is the one that has been declared in the contribution to the transformers extension-point of the org.nuxeo.ecm.platform.transform.service.TransformService component.

Then the transformer exposes a transform method that returns a list of TransformDocument. The arguments are the:

  • options: the plugin options (the keys are the plugin names) - we pass null here as we do not have any options. See officeMerger plugin for a more detailed example

  • list of sources as TransformDocument instances

There are three levels of options that overload. First the plugin options are the default. Then any option in the transformer that define an option for this plugin overload them. Finally, any code-defined options are merged, overloading any previous option that may have been already defined. Please note again that options are defined on a plugin name basis.

A TransformDocument instance can be constructed from:

  • a source stream

  • a source stream and its mime-type

  • a blob

In our example, we use the second way and give a sourceStream and the mime-type of an ODF document.

Once the input list processed, the results list contains all the transformed files as TransformDocument instances from which you can retrieve the SerializableInputStream stream with getStream method, the mime-type using getMimetype and the blob with getBlob method.

An alternate way of using the transform is to call it directly from the service

List<TransformDocument> results = service.transform(converter,
        null, new TransformDocumentImpl(stream, sourceMimetype));

The arguments are the same and the converter name is given as first argument. An other constructor using blobs as input instead of TransformDocument is also available. Before calling a transform we can also check that the source mime-type is supported by calling the isMimetypeSupportedByPlugin method. Be careful though that this plugin name may be different than the transformer name.

Transforms can be called directly but are also part of the docModifier framework that reacting on events, can call transforms to alter or generate new informations (see below oleExtract plugin or docModifier documentation)

18.3. Available transforms

All the transform plugin packages start with org.nuxeo.ecm.platform.transform.plugin. Some of them are optional and not included in the default platform.

18.3.1. Document conversion

The document conversion plugin is a generic transform delivered in Nuxeo that allow transforming a file from a format to an other. It uses the third-party JODconverter tool. The transform is implemented in the org.nuxeo.ecm.platform.transform.plugin.joooconverter.impl.JOOoConvertPluginImpl class and defined as a plugin contribution to org.nuxeo.ecm.platform.transform.service.TransformService .

Then it can be classically called as explained in the previous section:

TransformService service = NXTransform.getTransformService();
Transformer transformer = service.getTransformerByName("any2pdf");

List<TransformDocument> results = transformer.transform(null,
        new TransformDocumentImpl(sourceStream, "application/vnd.oasis.opendocument.text"));

SerializableInputStream resultStream = results.get(0).getStream();

The transform call relays to the JOOoConvertPluginImpl that first connects to OOo on port and host defined in the contribution (usually localhost:8100) or in the options (not defined in our example). It then acquire an OpenOfficeDocumentConverter from JODconverter tool and then can call the convert method with the requested target mimetype and source file. Note that in future version, the StreamOpenOfficeDocumentConverter class will be used to avoid dealing with File objects. This limitation will be solved when a new version of OpenOffice.org (2.3) will be out and solves a regression on loading streams.

Before any call to the underlying OpenOffice.org converter, once in the transformation engine JOOoConvertPluginImpl, the source document mimetype is tested. If it is the same than the requested destinationMimetype defined by the plugin, the source file is returned immediately as result unless the mimetype occurs in the <sourceMimeType> of the plugin to allow self-transformations. By default, any2pdf plugin will return immediately if an application/pdf file is submitted while any OpenDocument transformation (any2odt, any2ods, any2odp) will process the files as each one contains its own mimetype like <sourceMimeType>application/vnd.oasis.opendocument.text</sourceMimeType>. This allow to clean and validate files forged by hand and possibly apply automatic treatments at OpenOffice.org side.

18.3.1.1. Remote engine

Since OOo 2.3.0, it is possible to use InputStreams as source documents so that no more File objects are needed. This is useful if we want to isolate the OOo server on a separate machine (and if needed use a farm with load balancing if heavy work is needed) As this feature is only available with OOo2.3+ (due to a bug in previous versions), there is a nuxeo.property to define the OOo version used.

org.nuxeo.ecm.platform.transform.ooo.version=2.3.0

At term, the JODConverter tool should be able to return the OOo version it is using, so that this property will not be needed anymore. For the moment, let use this property. But as OOo stream loading has been reported as less efficient than File/URL by JODConverter users, the streaming method is only used if OOo is really a remote instance. This is analyzed if the ooo_host_name transformer option is 'localhost' or starts with '127.0.0'. And of course, only if OOo declared version is greater than 2.3.0.

18.3.1.2. OpenOffice.org loading options

As the underlying engine is based on OpenOffice.org, one can extend the document converter by supporting OOo loading options. The transform plugins options mechanism is fully supported, so they can be defined at plugin, transformer and code level. The available fields to be passed to OOo is listed at MediaDescriptor IDL reference and have to be bound to the plugin name.

Here is an example passing two options for loading conditions of a special document that is a password protected and has an autostart bound event that delete the content of the file to put the word DELETED in the document. This simulate any other heavy process such as document merging or automatic mail-merge from a database.

Map<String, Map<String, Serializable>> options = new HashMap<String, Map<String, Serializable>>();
Map<String, Serializable> pdfOptions = new HashMap<String, Serializable>();

pdfOptions.put("Password", "TheDocumentPassWord");
short ALWAYS_EXECUTE_NO_WARN = 4;
pdfOptions.put("MacroExecutionMode", ALWAYS_EXECUTE_NO_WARN);
pdfOptions.put("ReadOnly", false);
pdfOptions.put("Hidden", false);
options.put("any2pdf", pdfOptions);

// Note: due to password, file mimetype can not be sniffed
// so mimetype has to supplied to TransformDocumentImpl
results = service.transform("any2pdf", options,
        new TransformDocumentImpl(getBlobFromPath(path), "application/vnd.oasis.opendocument.text"));
assertTrue(results.size() > 0);

// The macro on the open event replaces the text by "DELETED"
assertEquals("pdf content", "DELETED", DocumentTestUtils.readPdfText(pdfFile));

First, we build the plugin options map. The Password field is in charge of sending the password string to the OOo loader so that the file can be opened. The MacroExecutionMode field defines how macros and script are handled at startup. By default, the consequence of the -headless OOo mode, is that the NEVER_EXECUTE = 0 value is used. One can change this default behaviour by using any value listed in the OOo MacroExecMode constant group. One important thing to be noted is that type is important: Password require a String argument while MacroExecutionMode requires a short one. The types have to be correct otherwise the field will not be handled by OOo.

By default, JODConverter sets the ReadOnly option as true. As we want to modify the document, we will have to set the ReadOnly flag to false. The Hidden flag has also to be set to false. This trick is due to a problem in OpenOffice.org PDF engine that seems not to be able to handle document modification while Hidden. As an headless server-deployed OOo instance, this should not be a major problem.

Once Options have been defined, they are globally merged to the options map under the plugin name key (here, any2pdf)

Then the transform can be called as usual. Please note that mimetype of password document have to be passed explicitly to the TransformDocumentImpl constructor as it can not be sniffed by the Mimetype service for the moment.

Finally, as expected a PDF document is returned with its content changed to the DELETED string.

18.3.2. Pdfbox

18.3.3. OLE objects extraction

OLE objects are objects included in office files that can be edited as standalone ones. For example, a spreadsheet table may be included in a report so that the presented datas are always up to date.

This plugin is located in nuxeo-plateform-transform-plugin-oleextract module and is not include by default in the plateform.

18.3.3.1. Implementation

The purpose of this plugin is to extract all these Ole objects and provide them as standalone files so that they can be checked individually. It is also extended to extract images. It has a classical Transform plugin structure, the plugin name is oleExtractPlugin bound to org.nuxeo.ecm.platform.transform.plugin.oleextract.impl.OfficeOleExtractorPluginImpl. The transform name is oleExtract.

As an example it can be called like

List <TransformDocument > results =
         service.transform(TRANSFORMER_NAME, null, new TransformDocumentImpl(stream, mimetype));
TransformDocument result = results.get(0);
List <Map<String, Serializable>> ole =
         (List < Map < String, Serializable > >) result.getPropertyValue("ole:olecontents");

First, the transform service is classically called, the TRANSFORMER_NAME being set to oleExtract. After processing, the only TransformDocument returned result contains a property ole:olecontents that gives the list of embedded objects that have been extracted.

The olecontents schema is defined in olecontent.xsd. Each element of the list contains the following fields

<xs:complexType name="olecontent">
  <xs:sequence>
    <xs:element name="displayname" type="xs:string" />
    <xs:element name="filename" type="xs:string" />
    <xs:element name="mime-type" type="xs:string" />
    <xs:element name="data" type="nxs:content"/>
    <xs:element name="thumbnail-mime-type" type="xs:string"/>
    <xs:element name="thumbnail-data" type="nxs:content"/>
  </xs:sequence>
</xs:complexType>

Each olecontent element contains file datas (data & mime-type) and thumbnail ones (thumbnail-mime-type & thumbnail-data). The displayname is the name retrieved in the office file if it was named otherwise the internal one that has been given in the office file. The filename fields is built from the displayname and the extension deduced from the mime-type.

Based on this new schema, the File doctype is extended in the org.nuxeo.ecm.platform.transform.oleextract.coretypes contributing type and schema needed by oleExtract

<extension target="org.nuxeo.ecm.core.schema.TypeService" point="doctype">
  <doctype name="FileWithOle" extends="File">
    <schema name="olecontents" />
  </doctype>
</extension>

A new FileWithOle doctype, based on File, is defined. It can be subtype of Workspace and Folder and is defined as a contribution of org.nuxeo.ecm.platform.types.TypeService which allows to create new documents based on it.

The oleExtract transform plugin has been bound to the oleExtractModifier docModifier. The contribution to the extension point is defined in

<extension target="org.nuxeo.ecm.platform.modifier.service.DocModifierService"
           point="docTypeToTransformer">
  <documentation>
    docModifier for oleExtract transform plugin.
  </documentation>
  <docModifier name="oleExtractModifier"
               documentType="FileWithOle"
               transformationPluginName="oleExtract"
               sourceFieldName="file:content"
               destinationFieldName="file:content">
    <coreEvent>documentCreated</coreEvent>
    <coreEvent>documentModified</coreEvent>
    <customField name="olecontents:olecontents" transformParamName="ole:olecontents"/>
    <customOutputField outputParamName="ole:olecontents" name="olecontents:olecontents"/>
  </docModifier>
</extension>

This contribution reacts on document creation and modification. It receives the initial office file file:content and gives back the olecontents:olecontents back mapped top the ole:olecontents we saw above. The initial file is returned unchanged.

With this docModifier reacting on some events, the oleExtract results can now be integrated in the application. The org.nuxeo.ecm.platform.transform.oleextract.action component defines an ActionService contribution

<action id="TAB_OLEOBJECT" link="/incl/tabs/document_oleobjects.xhtml"
        enabled="true" label="action.view.ole" order="49">
  <category>VIEW_ACTION_LIST</category>
  <filter id="view_ole">
    <rule grant="true">
      <type>FileWithOle</type>
    </rule>
  </filter>
</action> 

The TAB_OLEOBJECT action defines a new tab listing the olecontents:olecontents elements and providing links for retrieving each file provided the document is of the current type FileWithOle.

18.3.3.2. Extraction details

OleExtract is based on the parsing of OpenDocument File format. If the submitted file is an ODF one, then it is unzipped and processed without any connection to OpenOffice.org. If the file is not an ODF one, then a converter plugin is used according to the source mime type. In this case, OpenOffice.org is required as a JODConverter dependency.

Once we have an ODF file it is unzipped and its content.xml is parsed to find draw:object-ole, draw:object and draw:image related tags. For each one, the name of the resource is retrieved if it exists. Then for each found resource, the alternate view is retrieved so that a preview can be proposed when listing this content (still under development)

ODF resources in an ODF file (think at a spreadsheet diagram embedded in a text document) are stored in directories and flat XML form while other resources are stored in a binary format. So for ODF resources the global manifest file is parsed to isolate the files with their correct manifest:media-type so that the new ODF archive for the Ole object can be built. Once the new manifest file is created, the embedded ODF directory is zipped and this binary form is returned.

18.3.4. Office files merger

This transform plugin is contained in nuxeo-plateform-transform-plugin-officemerger module and is not include by default in the platform. Its purpose is to build a new file as the result of the merging of the list given as parameters. It uses OpenOffice.org merging capabilities, mainly through the insertDocumentFromURL UNO method for text documents.

The public method merge is available from OfficeMergerImpl and returns a SerializableInputStream containing the resulting document.

OfficeMergerImpl merger = new OfficeMergerImpl();
SerializableInputStream result = merger.merge(sourceFiles, engineType, converter,
         outlineRank, withPageBreaks);
SerializableInputStream result = merger.mergeStreams (sourceStreams, engineType,
         converter, outlineRank, withPageBreaks);

Some options have been added to enhance the building of the main document. Here is the list of the arguments of the merge method

  • sourceFiles/sourceStreams: Ordered array of File objects or streams to be merged.

  • engineType: Depending on the nature of source documents, the OpenOffice.org API to be used is obviously not the same. If source files are text file, the user will probably want to have a text file as a result while if he deals with slides, the results is expected as a presentation. This String argument tells which engine to be used (text, presentation, spreadsheet- only text is already implemented) - Default text

  • converter: Once the document is built, the Document Converter plugin can be called automatically to create the final document. This is the converter name that is expected (eg. any2pdf) and an exception is raised if it does not exist or the mime type deduced from the engineType is not supported. If an empty string is provided, no transform occur at the end of the merging - Default empty

  • outlineRank: This is the rank (compared to the file list) where a Table Of Content may appear. For example, if the value is 3, then the first two files of the file list are inserted, the T.O.C is built and inserted and then the remaining files are processed. The Table of Content is refreshed at the end of the whole insertion. A value of 0 means no T.O.C. - Default 0

  • withPageBreaks: A boolean that adds or removes page breaks between file insertions - Default true

The plugin engine, the merge method is available directly but the principal use will occur through a Transform call. The Transform name is officeMerger and the package name is org.nuxeo.ecm.platform.transform.plugin.officemerger

<plugin name="OfficeMergerPlugin"
        class="org.nuxeo.ecm.platform.transform.plugin.officemerger.impl.OfficeMergerImpl"
        destinationMimeType="application/vnd.oasis.opendocument.text">
  <sourceMimeType>application/msword</sourceMimeType>
  <sourceMimeType>application/vnd.oasis.opendocument.text</sourceMimeType>
  <sourceMimeType>application/vnd.sun.xml.writer</sourceMimeType>
  <option name="ooo_host_name">localhost</option>
  <option name="ooo_host_port">8100</option>
</plugin>

Note that OpenOffice.org has to be listening from incoming UNO connections on the specified interface ooo_host_name and port ooo_host_port. If it is not the case, an OpenOfficeException exception will be raised.

So, once defined, the officeMerger transform can be called passing the options like

Map<String, Serializable> mergingOptions = new HashMap<String, Serializable>();
mergingOptions.put("engineType", "text");
mergingOptions.put("converter", "any2pdf");
mergingOptions.put("outlineRank", 0);
mergingOptions.put("withPageBreaks", false);

options.put("officeMerger", mergingOptions);

List<TransformDocument> results = transformer.transform(options, sourceFiles);

Note that mergingOptions can be incomplete or even null. The options will then take their default values.

The results list contains the final merged document, and converted if requested, at first index.

18.3.5. XSL Transformation

The XSLT plugin provides XSL transformations in Nuxeo. It allows you to transform XML documents using a XSL stylesheet as defined in the XSLT Specification. The plugin is implemented in the org.nuxeo.ecm.platform.transform.plugin.xslt.impl.XSLTPluginImpl class and defined as a plugin contribution to org.nuxeo.ecm.platform.transform.service.TransformService.

The XSLT Plugin accepts XML documents as source files, and you must provide the XSL stylesheet as a plugin's option named stylesheet. The XSL stylesheet must be provided as a Blob.

Then, you can easily transform your documents:

final Map<String, Serializable> pluginOptions = 
        new HashMap<String, Serializable>();
pluginOptions.put("stylesheet", (FileBlob) getXSLStylesheetBlob(...);
final Map<String, Map<String, Serializable>> options = 
        new HashMap<String, Map<String, Serializable>>();
options.put("xslt", pluginOptions);

TransformServiceCommon service = TransformServiceDelegate.getLocalTransformService();
final List<TransformDocument> results = service.transform(
        "xslt", options, xmlSourceFiles);        

The resulting documents' mime-type is set depending of the method attribute of the xsl:output element, specified in the XSL stylesheet. If there is no method attribute defined in the XSL stylesheet, a default value is chosen for the method attribute as defined in the XSLT Specification.

The mime-type is chosen as follows:

  • text/html: if the output method is html.

  • text/plain: if the output method is text.

  • text/xml: if the output method is xml.

If there is no XSL stylesheet provided to the plugin or if an error occurs during the transformation (corrupted xml or xsl for instance), a TransformException is thrown.