Table of Contents
There are cases when some informations could be retrieved from attached files and keep this info as regular Document properties. This info is refered to as metadata as it is a descriptive info refering the document (file) from which it is extracted. For example given a MS Word document (file) we could retrieve info (metadata) like author, title, creation date etc, simply by reading document headers with an appropriate library (that knows how to parse and keep a MS Word document internal structure).
The metadata extraction feature is composed of several projects:
- nuxeo-platform-metadataext-api
- nuxeo-platform-metadataext-core
- nuxeo-platform-metadataext-facade
- nuxeo-platform-metadataext-plugins
The core part defines the metadata extraction component with a
MetaDataExtractionManager
implementation:
org.nuxeo.ecm.platform.metadataext.services.MetaDataExtractionService
.
The service is normally invoked by a dedicated
CoreListener
that passes a
DocumentModel
for which an extraction could
be defined. Also the metadata could be extracted by direct invocation from
a client code through existing EJB3 facade.
The plugin part provides a plugin (more to come) which is specific to the MS-Word document type. The plugin is a Transformation Plugin (as defined by the Transformation service) and gets invoked as part of a defined transformation chain. The apropriate transformation will be called by the metadata extraction service if there is a contribution for the given Document type, etc.
A contribution can be defined by adding an XML file with the following structure:
<extension target="org.nuxeo.ecm.platform.metadataext.services.MetaDataExtraction" point="extractions"> <meta-data-extraction inputField="file:content" transformationName="MSWordMDExt"> <outputParams> <param propertyName="dc:title">title</param> <param propertyName="dc:description">comments</param> <param propertyName="dc:created">creationDate</param> <param propertyName="dc:modified">creationDate</param> <param propertyName="dc:contributors">authors</param> </outputParams> <coreEvent>documentCreated</coreEvent> <coreEvent>documentModified</coreEvent> <docType>File</docType> </meta-data-extraction> </extension>
The important aspects for MetaDataExtraction service are:
- specification of a source (blob) from where metadata will be
extracted. This is defined by the value
inputField
which should be a blob document
property.
- the mapping of output parameters. This defines a corespondence between the map entries returned as the result of transformation (extraction) and the Document properties names to which the results will be written back.
- the list of core events for which the extraction will be performed.
- the list of document types for which the extraction can be applied.
If the extraction is requiring additional information aside from the provided blob, we can add input parameters to an extraction definition like this:
<inputParams> <param propertyName="dc:title">title</param> </inputParams>
A map with input parameters will be
passed to the extraction plugin having the key the value of tag
param
(in this case 'title').
In extreme cases the extractions of metadata could be performed in several steps. That is using the extracted parameters in phase as input parameters for the next phase. This could be achieved by specifying the input params whose values could have been set by a previous extraction (transformation).
We first define a class implementing
org.nuxeo.ecm.platform.transform.interfaces.Plugin
interface like:
public class MSWordMDExtractorPlugin extends AbstractPlugin { @Override public List<TransformDocument> transform(Map<String, Serializable> options, TransformDocument... sources) throws Exception { ... } ... }
Overriding the transform
method is a must and
in this case we will have only one
TransformDocument
as source.
The blob from which metadata could be extracted can be retrieved
from TransformDocument
like: Blob
srcBlob = sources[0].getBlob()
.
After the useful information is extracted the properties should be
set to a TransformDocument
that will be
returned:
TransformDocumentImpl res = new TransformDocumentImpl(); res.setPropertyValue("title", extractedTitle);
Having a plugin class defined that knows how to extract information from a specific type of file (MS Word document, MS Excel, PDF, OO doc etc) we can add the contributions necessary for the metadata extraction to take place.
We need to define first the transformation plugin and the transformation contributions. After that the defined transformation can be refered in the metadata extraction specific contribution.
Step 1: define the transformation plugin:
<extension target="org.nuxeo.ecm.platform.transform.service.TransformService" point="plugins"> <documentation> Set of default transformation plugins for metadata extraction. </documentation> <plugin name="MSWordMDExtPlugin" class="org.nuxeo.ecm.platform.metadataext.plugins.MSWordMDExtractorPlugin" destinationMimeType="application/msword"> <sourceMimeType>application/msword</sourceMimeType> </plugin> </extension>
Step 2: define the transformation:
<extension target="org.nuxeo.ecm.platform.transform.service.TransformService" point="transformers"> <documentation> Set of default transformation chains for metadata extraction. </documentation> <transformer name="MSWordMDExt" class="org.nuxeo.ecm.platform.transform.transformer.TransformerImpl"> <plugins> <plugin name="MSWordMDExtPlugin" /> </plugins> </transformer> </extension>
Step 3: define the metadata extraction specific contribution
<extension
target="org.nuxeo.ecm.platform.metadataext.services.MetaDataExtraction"
point="extractions">
<meta-data-extraction inputField="file:content"
transformationName="MSWordMDExt">
...
This last one is refering to the above defined transformation.