Table of Contents
Transforms are operations that perform any action modifying an input document. This can cover file conversion as well as mail merging or information extraction.
The various plugins are stored in the
nuxeo-plateform-transforms-plugins
.* module. They are
declared as a contribution of the
org.nuxeo.ecm.platform.transform.service.TransformService
component at extension point plugins
.
The plugins are the engines that perform the needed transformations.
A transform plugin follows a framework defined in the
nuxeo-plateform-transforms-core
module. It is
basically a class that extends a
org.nuxeo.ecm.platform.transform.interfaces.Plugin
.
The main common method that has to be exposed is
transform
List<TransformDocument> transform( Map<String, Serializable> options, TransformDocument... sources);
The transform
method returns a list of
TransformDocument
and accepts an
options
map
and TransformDocument sources
.
TransformDocument
object is defined in
org.nuxeo.ecm.platform.transform.interfaces.TransformDocument
and holds the binary as well as other information such as mimetype.
The options
map holds all the necessary
options to be passed to the plugin. Please note that the keys are the
plugin names. See the officeMerger
plugin below for
an implementation example.
We first declare a contribution to the extension point
plugins
of
org.nuxeo.ecm.platform.transform.service.TransformService
:
<extension target="org.nuxeo.ecm.platform.transform.service.TransformService" point="plugins"> <plugin name="any2odt" class="org.nuxeo.ecm.platform.transform.plugin.joooconverter.impl.JOOoConvertPluginImpl" destinationMimeType="application/vnd.oasis.opendocument.text"> <sourceMimeType>text/xml</sourceMimeType> <sourceMimeType>text/plain</sourceMimeType> <sourceMimeType>text/rtf</sourceMimeType> <!-- Microsoft office documents --> <sourceMimeType>application/msword</sourceMimeType> <!-- OpenOffice.org 1.x documents --> <sourceMimeType>application/vnd.sun.xml.writer</sourceMimeType> <sourceMimeType>application/vnd.sun.xml.writer.template</sourceMimeType> <!-- OpenOffice.org 2.x documents --> <sourceMimeType>application/vnd.oasis.opendocument.text</sourceMimeType> <sourceMimeType>application/vnd.oasis.opendocument.text-template</sourceMimeType> <option name="ooo_host_name">localhost</option> <option name="ooo_host_port">8100</option> </plugin> </extension>
The name
attribute will be used to declare the
transform chain. The class
attribute is the class
that will do the effective job of the transformation while the
destinationMimeType
is the mime-type of the result
of the transform.
After attributes, <sourceMimeType>
nodes
define the allowed input mime-types the transform is supporting. In the
presented case, we can see that the any2odt
plugin
will be able to handle text, Microsoft office word , OOo 1.x and
OpenDocument format (OOo2.x) files to output an
application/vnd.oasis.opendocument.text
OpenDocument file. Options can also be added as we see for
<option>
tags with
ooo_host_name
and ooo_host_port
attributes.
Plugins can be combined to build transform chains. This chains are
declared in a transformer which is a contribution to the extension-point
transformers
of
org.nuxeo.ecm.platform.transform.service.TransformService
component.
<extension target="org.nuxeo.ecm.platform.transform.service.TransformService" point="transformers"> <transformer name="any2text" class="org.nuxeo.ecm.platform.transform.transformer.TransformerImpl"> <plugins> <plugin name="any2pdf"/> <plugin name="pdf2text"/> </plugins> </transformer> </extension>
A transformer is defined by its name
. This is
this name
that will be used to initialize a transform
service when using it.
Then, plugins involved in the chain are listed. Wen can see that our
any2txt
transformer is composed with two chained
plugins: any2pdf
then pdf2txt
.
Obviously, a single plugin for a transform is legal as we can see with
the use of our previous any2odt
plugins.
<extension target="org.nuxeo.ecm.platform.transform.service.TransformService" point="transformers"> <!-- This transformer uses a the OOo plugin to transform documents to ODT--> <transformer name="any2odt" class="org.nuxeo.ecm.platform.transform.transformer.TransformerImpl"> <plugins> <plugin name="any2odt"/> </plugins> </transformer> </extension>
Once a transform plugin has been declared and the transformer is known, we can use it to perform various transformation actions.
TransformService service = NXTransform.getTransformService(); Transformer transformer = service.getTransformerByName("any2pdf"); List<TransformDocument> results = transformer.transform(null, new TransformDocumentImpl(sourceStream, "application/vnd.oasis.opendocument.text")); SerializableInputStream resultStream = results.get(0).getStream();
We first get the TransformService
that exposes
all the available transforms. Then a specific
transformer
is built with the
getTransformerByName
method of the service. The
name is the one that has been declared in the contribution to the
transformers
extension-point of the
org.nuxeo.ecm.platform.transform.service.TransformService
component.
Then the transformer exposes a transform
method
that returns a list of TransformDocument
. The
arguments are the:
options: the plugin options (the keys are the plugin names) - we
pass null here as we do not have any options. See
officeMerger
plugin for a more detailed
example
list of sources as TransformDocument
instances
There are three levels of options that overload. First the
plugin
options are the default. Then any option in
the transformer
that define an option for this plugin
overload them. Finally, any code-defined options are merged, overloading
any previous option that may have been already defined. Please note
again that options are defined on a plugin name basis.
A TransformDocument
instance can be constructed
from:
a source stream
a source stream and its mime-type
a blob
In our example, we use the second way and give a
sourceStream
and the mime-type of an ODF document.
Once the input list processed, the results list contains all the
transformed files as TransformDocument
instances from
which you can retrieve the SerializableInputStream
stream with getStream
method, the mime-type using
getMimetype
and the blob with
getBlob
method.
An alternate way of using the transform is to call it directly from the service
List<TransformDocument> results = service.transform(converter, null, new TransformDocumentImpl(stream, sourceMimetype));
The arguments are the same and the converter
name
is given as first argument. An other constructor using blobs as input
instead of TransformDocument
is also available.
Before calling a transform we can also check that the source mime-type
is supported by calling the
isMimetypeSupportedByPlugin
method. Be careful though
that this plugin name may be different than the transformer name.
Transforms can be called directly but are also part of the docModifier framework that reacting on events, can call transforms to alter or generate new informations (see below oleExtract plugin or docModifier documentation)
All the transform plugin packages start with
org.nuxeo.ecm.platform.transform.plugin
. Some of them
are optional and not included in the default platform.
The document conversion plugin is a generic transform delivered in
Nuxeo that allow transforming a file from a format to an other. It uses
the third-party JODconverter tool. The transform is implemented in the
org.nuxeo.ecm.platform.transform.plugin.joooconverter.impl.JOOoConvertPluginImpl
class and defined as a plugin
contribution to
org.nuxeo.ecm.platform.transform.service.TransformService
.
Then it can be classically called as explained in the previous section:
TransformService service = NXTransform.getTransformService(); Transformer transformer = service.getTransformerByName("any2pdf"); List<TransformDocument> results = transformer.transform(null, new TransformDocumentImpl(sourceStream, "application/vnd.oasis.opendocument.text")); SerializableInputStream resultStream = results.get(0).getStream();
The transform
call relays to the
JOOoConvertPluginImpl
that first connects to OOo on
port and host defined in the contribution (usually
localhost:8100
) or in the options (not defined in
our example). It then acquire an
OpenOfficeDocumentConverter
from JODconverter tool
and then can call the convert
method with the
requested target mimetype and source file. Note that in future version,
the StreamOpenOfficeDocumentConverter
class will be
used to avoid dealing with File
objects. This
limitation will be solved when a new version of OpenOffice.org (2.3)
will be out and solves a regression on loading streams.
Before any call to the underlying OpenOffice.org converter, once in
the transformation engine JOOoConvertPluginImpl
, the
source document mimetype is tested. If it is the same than the requested
destinationMimetype
defined by the plugin, the
source file is returned immediately as result unless the mimetype occurs
in the <sourceMimeType>
of the plugin to allow
self-transformations. By default, any2pdf
plugin will
return immediately if an application/pdf
file is
submitted while any OpenDocument transformation
(any2odt
, any2ods
,
any2odp
) will process the files as each one
contains its own mimetype like
<sourceMimeType>application/vnd.oasis.opendocument.text</sourceMimeType>
.
This allow to clean and validate files forged by hand and possibly apply
automatic treatments at OpenOffice.org side.
Since OOo 2.3.0, it is possible to use
InputStreams
as source documents so that no more
File
objects are needed. This is useful if we
want to isolate the OOo server on a separate machine (and if needed
use a farm with load balancing if heavy work is needed) As this
feature is only available with OOo2.3+ (due to a bug in previous
versions), there is a nuxeo.property
to define the
OOo version
used.
org.nuxeo.ecm.platform.transform.ooo.version=2.3.0
At
term, the JODConverter tool should be able to return the OOo version
it is using, so that this property will not be needed anymore. For the
moment, let use this property. But as OOo stream loading has been
reported as less efficient than
File
/URL
by JODConverter users,
the streaming method is only used if OOo is really a remote instance.
This is analyzed if the ooo_host_name
transformer
option is 'localhost
' or starts with
'127.0.0
'. And of course, only if OOo declared
version is greater than 2.3.0.
As the underlying engine is based on OpenOffice.org, one can extend the document converter by supporting OOo loading options. The transform plugins options mechanism is fully supported, so they can be defined at plugin, transformer and code level. The available fields to be passed to OOo is listed at MediaDescriptor IDL reference and have to be bound to the plugin name.
Here is an example passing two options for loading conditions of a special document that is a password protected and has an autostart bound event that delete the content of the file to put the word DELETED in the document. This simulate any other heavy process such as document merging or automatic mail-merge from a database.
Map<String, Map<String, Serializable>> options = new HashMap<String, Map<String, Serializable>>(); Map<String, Serializable> pdfOptions = new HashMap<String, Serializable>(); pdfOptions.put("Password", "TheDocumentPassWord"); short ALWAYS_EXECUTE_NO_WARN = 4; pdfOptions.put("MacroExecutionMode", ALWAYS_EXECUTE_NO_WARN); pdfOptions.put("ReadOnly", false); pdfOptions.put("Hidden", false); options.put("any2pdf", pdfOptions); // Note: due to password, file mimetype can not be sniffed // so mimetype has to supplied to TransformDocumentImpl results = service.transform("any2pdf", options, new TransformDocumentImpl(getBlobFromPath(path), "application/vnd.oasis.opendocument.text")); assertTrue(results.size() > 0); // The macro on the open event replaces the text by "DELETED" assertEquals("pdf content", "DELETED", DocumentTestUtils.readPdfText(pdfFile));
First, we build the plugin options map. The
Password
field is in charge of sending the password
string to the OOo loader so that the file can be opened. The
MacroExecutionMode
field defines how macros and
script are handled at startup. By default, the consequence of the
-headless
OOo mode, is that the
NEVER_EXECUTE = 0
value is used. One can change
this default behaviour by using any value listed in the OOo MacroExecMode constant group. One important thing to be
noted is that type is important: Password
require a
String
argument while
MacroExecutionMode
requires a
short
one. The types have to be correct otherwise
the field will not be handled by OOo.
By default, JODConverter sets the ReadOnly
option as true. As we want to modify the document, we will have to set
the ReadOnly
flag to false
. The
Hidden
flag has also to be set to false. This
trick is due to a problem in OpenOffice.org PDF engine that seems not
to be able to handle document modification while
Hidden
. As an headless server-deployed OOo
instance, this should not be a major problem.
Once Options have been defined, they are globally merged to the
options map under the plugin name key (here,
any2pdf
)
Then the transform can be called as usual. Please note that
mimetype of password document have to be passed explicitly to the
TransformDocumentImpl
constructor as it can not
be sniffed by the Mimetype
service for the moment.
Finally, as expected a PDF document is returned with its content
changed to the DELETED
string.
OLE objects are objects included in office files that can be edited as standalone ones. For example, a spreadsheet table may be included in a report so that the presented datas are always up to date.
This plugin is located in
nuxeo-plateform-transform-plugin-oleextract
module
and is not include by default in the plateform.
The purpose of this plugin is to extract all these Ole objects and
provide them as standalone files so that they can be checked
individually. It is also extended to extract images. It has a
classical Transform plugin structure, the plugin name is
oleExtractPlugin
bound to
org.nuxeo.ecm.platform.transform.plugin.oleextract.impl.OfficeOleExtractorPluginImpl
.
The transform name is oleExtract
.
As an example it can be called like
List <TransformDocument > results = service.transform(TRANSFORMER_NAME, null, new TransformDocumentImpl(stream, mimetype)); TransformDocument result = results.get(0); List <Map<String, Serializable>> ole = (List < Map < String, Serializable > >) result.getPropertyValue("ole:olecontents");
First, the transform service is classically called, the
TRANSFORMER_NAME
being set to
oleExtract
. After processing, the only
TransformDocument
returned result contains a
property ole:olecontents
that gives the list of
embedded objects that have been extracted.
The olecontents
schema is defined in
olecontent.xsd
. Each element of the list contains
the following fields
<xs:complexType name="olecontent"> <xs:sequence> <xs:element name="displayname" type="xs:string" /> <xs:element name="filename" type="xs:string" /> <xs:element name="mime-type" type="xs:string" /> <xs:element name="data" type="nxs:content"/> <xs:element name="thumbnail-mime-type" type="xs:string"/> <xs:element name="thumbnail-data" type="nxs:content"/> </xs:sequence> </xs:complexType>
Each olecontent
element contains file datas
(data
& mime-type
) and
thumbnail ones (thumbnail-mime-type
&
thumbnail-data
). The
displayname
is the name retrieved in the office
file if it was named otherwise the internal one that has been given in
the office file. The filename
fields is built from
the displayname
and the extension deduced from the
mime-type
.
Based on this new schema, the File doctype is extended in the org.nuxeo.ecm.platform.transform.oleextract.coretypes contributing type and schema needed by oleExtract
<extension target="org.nuxeo.ecm.core.schema.TypeService" point="doctype"> <doctype name="FileWithOle" extends="File"> <schema name="olecontents" /> </doctype> </extension>
A new FileWithOle doctype
, based on
File
, is defined. It can be subtype of
Workspace
and Folder
and is
defined as a contribution of
org.nuxeo.ecm.platform.types.TypeService
which
allows to create new documents based on it.
The oleExtract transform plugin has been bound to the
oleExtractModifier
docModifier. The contribution
to the extension point is defined in
<extension target="org.nuxeo.ecm.platform.modifier.service.DocModifierService" point="docTypeToTransformer"> <documentation> docModifier for oleExtract transform plugin. </documentation> <docModifier name="oleExtractModifier" documentType="FileWithOle" transformationPluginName="oleExtract" sourceFieldName="file:content" destinationFieldName="file:content"> <coreEvent>documentCreated</coreEvent> <coreEvent>documentModified</coreEvent> <customField name="olecontents:olecontents" transformParamName="ole:olecontents"/> <customOutputField outputParamName="ole:olecontents" name="olecontents:olecontents"/> </docModifier> </extension>
This contribution reacts on document creation and modification. It
receives the initial office file file:content
and
gives back the olecontents:olecontents
back mapped
top the ole:olecontents
we saw above. The initial
file is returned unchanged.
With this docModifier reacting on some events, the oleExtract results can now be integrated in the application. The org.nuxeo.ecm.platform.transform.oleextract.action component defines an ActionService contribution
<action id="TAB_OLEOBJECT" link="/incl/tabs/document_oleobjects.xhtml" enabled="true" label="action.view.ole" order="49"> <category>VIEW_ACTION_LIST</category> <filter id="view_ole"> <rule grant="true"> <type>FileWithOle</type> </rule> </filter> </action>
The TAB_OLEOBJECT
action defines a new tab
listing the olecontents:olecontents
elements and
providing links for retrieving each file provided the document is of
the current type FileWithOle
.
OleExtract is based on the parsing of OpenDocument File format. If the submitted file is an ODF one, then it is unzipped and processed without any connection to OpenOffice.org. If the file is not an ODF one, then a converter plugin is used according to the source mime type. In this case, OpenOffice.org is required as a JODConverter dependency.
Once we have an ODF file it is unzipped and its content.xml is
parsed to find draw:object-ole
,
draw:object
and draw:image
related tags. For each one, the name of the resource is retrieved if
it exists. Then for each found resource, the alternate view is
retrieved so that a preview can be proposed when listing this content
(still under development)
ODF resources in an ODF file (think at a spreadsheet diagram embedded in a text document) are stored in directories and flat XML form while other resources are stored in a binary format. So for ODF resources the global manifest file is parsed to isolate the files with their correct manifest:media-type so that the new ODF archive for the Ole object can be built. Once the new manifest file is created, the embedded ODF directory is zipped and this binary form is returned.
This transform plugin is contained in
nuxeo-plateform-transform-plugin-officemerger
module and is not include by default in the platform. Its purpose is to
build a new file as the result of the merging of the list given as
parameters. It uses OpenOffice.org merging capabilities, mainly through
the insertDocumentFromURL
UNO
method for text documents.
The public method merge
is available from
OfficeMergerImpl
and returns a
SerializableInputStream
containing the resulting
document.
OfficeMergerImpl merger = new OfficeMergerImpl(); SerializableInputStream result = merger.merge(sourceFiles, engineType, converter, outlineRank, withPageBreaks); SerializableInputStream result = merger.mergeStreams (sourceStreams, engineType, converter, outlineRank, withPageBreaks);
Some
options have been added to enhance the building of the main document.
Here is the list of the arguments of the merge
method
sourceFiles/sourceStreams
: Ordered array of
File objects or streams to be merged.
engineType
: Depending on the nature of source
documents, the OpenOffice.org API to be used is obviously not the
same. If source files are text file, the user will probably want to
have a text file as a result while if he deals with slides, the
results is expected as a presentation. This String argument tells
which engine to be used (text
,
presentation
,
spreadsheet
-
only text
is already implemented) - Default
text
converter
: Once the document is built, the
Document Converter plugin can be called automatically to create the
final document. This is the converter name that is expected (eg.
any2pdf
) and an exception is raised if it does
not exist or the mime type deduced from the
engineType
is not supported. If an empty string
is provided, no transform occur at the end of the merging - Default
empty
outlineRank
: This is the rank (compared to
the file list) where a Table Of Content may appear. For example, if
the value is 3, then the first two files of the file list are
inserted, the T.O.C is built and inserted and then the remaining
files are processed. The Table of Content is refreshed at the end of
the whole insertion. A value of 0 means no T.O.C. - Default
0
withPageBreaks
: A boolean that adds or
removes page breaks between file insertions - Default
true
The plugin engine, the merge
method is available
directly but the principal use will occur through a
Transform
call. The Transform name is
officeMerger
and the package name is
org.nuxeo.ecm.platform.transform.plugin.officemerger
<plugin name="OfficeMergerPlugin" class="org.nuxeo.ecm.platform.transform.plugin.officemerger.impl.OfficeMergerImpl" destinationMimeType="application/vnd.oasis.opendocument.text"> <sourceMimeType>application/msword</sourceMimeType> <sourceMimeType>application/vnd.oasis.opendocument.text</sourceMimeType> <sourceMimeType>application/vnd.sun.xml.writer</sourceMimeType> <option name="ooo_host_name">localhost</option> <option name="ooo_host_port">8100</option> </plugin>
Note that OpenOffice.org has to be listening from incoming
UNO
connections on the specified interface
ooo_host_name
and port
ooo_host_port
. If it is not the case, an
OpenOfficeException
exception will be raised.
So, once defined, the officeMerger
transform can
be called passing the options like
Map<String, Serializable> mergingOptions = new HashMap<String, Serializable>(); mergingOptions.put("engineType", "text"); mergingOptions.put("converter", "any2pdf"); mergingOptions.put("outlineRank", 0); mergingOptions.put("withPageBreaks", false); options.put("officeMerger", mergingOptions); List<TransformDocument> results = transformer.transform(options, sourceFiles);
Note that mergingOptions
can be incomplete or
even null
. The options will then take their default
values.
The results list contains the final merged document, and converted if requested, at first index.
The XSLT plugin provides XSL transformations in Nuxeo. It allows you
to transform XML documents using a XSL stylesheet as defined in the
XSLT
Specification. The plugin is implemented in the
org.nuxeo.ecm.platform.transform.plugin.xslt.impl.XSLTPluginImpl
class and defined as a plugin
contribution to
org.nuxeo.ecm.platform.transform.service.TransformService
.
The XSLT Plugin accepts XML documents as source files, and you must
provide the XSL stylesheet as a plugin's option named
stylesheet
. The XSL stylesheet must be provided as
a Blob
.
Then, you can easily transform your documents:
final Map<String, Serializable> pluginOptions = new HashMap<String, Serializable>(); pluginOptions.put("stylesheet", (FileBlob) getXSLStylesheetBlob(...); final Map<String, Map<String, Serializable>> options = new HashMap<String, Map<String, Serializable>>(); options.put("xslt", pluginOptions); TransformServiceCommon service = TransformServiceDelegate.getLocalTransformService(); final List<TransformDocument> results = service.transform( "xslt", options, xmlSourceFiles);
The resulting documents' mime-type is set depending of the
method
attribute of the
xsl:output
element, specified in the XSL stylesheet.
If there is no method
attribute defined in the XSL
stylesheet, a default value is chosen for the method
attribute as defined in the XSLT
Specification.
The mime-type is chosen as follows:
text/html
: if the output method is
html
.
text/plain
: if the output method is
text
.
text/xml
: if the output method is
xml
.
If there is no XSL stylesheet provided to the plugin or if an error
occurs during the transformation (corrupted xml or xsl for instance), a
TransformException
is thrown.