Table of Contents
The import / export service is providing an API to export a set of documents from the repository in an XML format and then re-importing them back.
The service can also be used to create in batch document trees from valid import archives or to provide a simple solution of creating and retrieving repository data. This could be used for example to expose repository data through REST or raw HTTP requests.
Export and import mechanism is extensible so that you can easily create you custom format for exported data. The default format provided by Nuxeo EP is described below.
The import / export module is part of the nuxeo-core-api
bundle and it is located under the org.nuxeo.ecm.core.api.io
package.
A document will be exported as a directory using as name the
document node name and containing a document.xml
file which hold the
document metadata and properties as defined by document schemas. Document
blobs if any are by default exported as separate files inside the document
directory. There is also an option to export blobs inlined as Base64
encoded data inside the document.xml
.
When exporting trees document children are put as subdirectories inside the document parent directory.
Optionally each service in nuxeo that store persistent data related to documents like the workflow, relation or annotation services may also export their own data inside the document folder as XML files.
A document tree will be exported as directory tree. Here is an
example of an export tree containing relations information for a workspace
named workspace1
:
+ workspace1 + document.xml + relations.xml + doc1 + document.xml + relations.xml + doc2 + document.xml + relations.xml + file1.blob + doc3 + document.xml
Here is an XML that correspond to a document containing a blob. The blob is exported as a separate file:
<?xml version="1.0" encoding="UTF-8"?> <document repository="default" id="633cf240-0c03-4326-8b3b-0960cf1a4d80"> <system> <type>File</type> <path>/default-domain/workspaces/ws/test</path> <lifecycle-state>project</lifecycle-state> <lifecycle-policy>default</lifecycle-policy> <access-control> <acl name="inherited"> <entry principal="administrators" permission="Everything" grant="true"/> <entry principal="members" permission="Read" grant="true"/> <entry principal="members" permission="Version" grant="true"/> <entry principal="Administrator" permission="Everything" grant="true"/> </acl> </access-control> </system> <schema xmlns="http://www.nuxeo.org/ecm/schemas/files/" name="files"> <files/> </schema> <schema xmlns:dc="http://www.nuxeo.org/ecm/schemas/dublincore/" name="dublincore"> <dc:valid/> <dc:issued/> <dc:coverage></dc:coverage> <dc:title>test</dc:title> <dc:modified>Fri Sep 21 20:49:26 CEST 2007</dc:modified> <dc:creator>Administrator</dc:creator> <dc:subjects/> <dc:expired/> <dc:language></dc:language> <dc:rights>test</dc:rights> <dc:contributors> <item>Administrator</item> </dc:contributors> <dc:created>Fri Sep 21 20:48:53 CEST 2007</dc:created> <dc:source></dc:source> <dc:description/> <dc:format></dc:format> </schema> <schema xmlns="http://www.nuxeo.org/ecm/schemas/file/" name="file"> <content> <encoding></encoding> <mime-type>application/octet-stream</mime-type> <data>cd1f161f.blob</data> </content> <filename>error.txt</filename> </schema> <schema xmlns="http://project.nuxeo.com/geide/schemas/uid/" name="uid"> <minor_version>0</minor_version> <uid/> <major_version>1</major_version> </schema> <schema xmlns="http://www.nuxeo.org/ecm/schemas/common/" name="common"> <icon-expanded/> <icon/> <size/> </schema> </document>
You can see that the generated document is containing one [system] section and one or more [schema] sections. The system section contains all system (internal) document properties like document type, path, lifecycle state and access control configuration. For each schema defined by the document type there is a schema entry which contains the document properties belonging to that schema. The XSD schema that correspond to that schema can be used to validate the content of the schema section. Anyway this is true only in the case of inlined blobs. By default, for performance reasons, the blobs are put outside the XML file in their own file.
So instead of encoding the blob in the XML file a reference to an external file is preserved: cd1f161f.blob
Here is how the same blob will be serialized when inlining blobs (an option of the repository reader):
<schema xmlns="http://www.nuxeo.org/ecm/schemas/file/" name="file"> <content> <encoding></encoding> <mime-type>application/octet-stream</mime-type> <data> b3JnLmpib3NzLnJlbW90aW5nLkNhbm5vdENvbm5lY3RFeGNlcHRpb246IENhbiBub3QgZ2V0IGNv bm5lY3Rpb24gdG8gc2VydmVyLiAgUHJvYmxlbSBlc3RhYmxpc2hpbmcgc29ja2V0IGNvbm5lY3Rp [...] </data> </content> <filename>error.txt</filename> </schema>
There is an option to inline the blob content in the XML file as a Base64 encoded text. This is less optimized but this is the canonic format to export a document data prior to XSD validation of document schemas.
Of course this is less optimized than writing the raw blob data in external files but provides a way to encode the entire document content in a single file and in a well known and validated format.
By default when exporting documents from the repository blobs are not inlined. To activate the inlining option you must set call the method on the DocumentModelReader you are using to fetch data from the repository:
reader.setInlineBlobs(boolean inlineBlobs);
An export process is a chain of three sub processes:
fetching data from repository
transforming the data if necessary
writing the data to an external system
In the same way an import can be defined as a chain of three sub processes:
fetching data from external sources
transforming the data if necessary
writing the data into the repository
We will name the process chain used to perform imports and exports as a Document Pipe.
In both cases (imports and exports) a document pipe is dealing with the same type of objects:
A document reader
Zero or more document transformers
A document writer
So the DocumentPipe will use a reader to fetch data that will be passed through registered transformers and then written down using a document writer.
See the API Examples for examples on how to use a Document Pipe.
A document reader is responsible to read some input data and convert it into a DOM representation. The DOM representation is using the format explained in Document XML section. Currently dom4j Documents are used as the DOM objects.
For example a reader may extract documents from the repository and to output it as XML DOM objects. Or it may be used to read files from a file system and convert them into DOM objects to be able to import them in a Nuxeo repository.
To change the way document are extracted and transformed to a DOM representation you can implement your own Document Reader. Currently Nuxeo provides several flavors of document readers:
Repository readers - these category of readers are used to extract data from the repository as DOM objects. All of these readers are extending DocumentModelReader:
SingleDocumentReader - this one reads a single document given its ID and export it as a dom4j Document.
DocumentChildrenReader - this one reads the children of a given document and export each one as dom4j Document.
DocumentTreeReader - this one reads the entire subtree rooted in the given document and export each node in the tree as a dom4j Document.
DocumentListReader - this one is taking as input a list of document models and export them as domj Documents. This is useful when wanting to export a search result for example.
External readers used to read data as DOM objects from external sources like file systems or databases. The following readers are provided:
XMLDirectoryReader - read a directory tree in the format supported by Nuxeo (as described in Export Format section). This can be used to import deflated nuxeo archives or hand created document directories.
NuxeoArchiveReader - read Nuxeo EP exported archives to import them in a repository. Note that only zip archives created by nuxeo exporter are supported.
ZipReader - read a zip archive and output DOM objects. This reader can read both Nuxeo zip archives and regular zip archives (hand made). Reading a Nuxeo archive is more optimized - because Nuxeo zip archives entries are added to the archive in a predefined order that makes possible reading the entire archive tree on the fly without unziping the content of the archive on the filesystem first. If the zip archive is not recognized as a Nuxeo archive the zip will be deflated in a temporary folder on the file system and the XMLDirectoryReader will be used to read the content.
To create a custom reader you need to implement the interface
org.nuxeo.ecm.core.api.io.DocumentReader
A document writer is responsible to write the documents that exit the pipe in a document store. This storage can be a File System, A Nuxeo Repository or any database or data storage as long as you have a writer that supports it.
The following DocumentWriters are provided by Nuxeo:
Repository Writers - These ones are writing documents to a Nuxeo repository. They are useful to perform imports into the repository.
DocumentModelWriter
- writes documents inside a Nuxeo
Repository. This writer is creating new document models for each
one of the imported documents.
DocumentModelUpdater
- writes documents inside a Nuxeo
Repository. This writer is updating documents that have the same
ID as the imported ones or create new documents otherwise.
External Writers - are writers that write documents on an external storage. They are useful to perform exports from the repository.
XMLDocumentWriter
- writes a document as a XML file with
blobs inlined.
XMLDocumentTreeWriter
- writes a list of documents inside a
unique XML file with blobs inlined. The document tags will be
included in a root tag
<documents> .. </documents>
XMLDirectoryWriter
- writes documents as a folder tree on
the file system. To read back the exported tree you may use
XMLDirectoryReader
NuxeoArchiveWriter
- writes documents inside a Nuxeo azip
archive. To read back the archive you may use the
NuxeoArchiveReader
To create a custom writer you need to implement the interface
org.nuxeo.ecm.core.api.io.DocumentWriter
Document transformers are useful to transform documents that enter the pipe and before being sent to the writer. This way you can remove, add or modify some properties from the documents, or other information contained by the exported DOM object.
As documents are expressed as XML DOM objects you can also use XSLT transformations inside your transformer.
To create a custom transformer you need to implement the interface
org.nuxeo.ecm.core.api.io.DocumentTransformer
Performing exports and imports can be done by following these steps:
Instantiate a new DocumentPipe:
// create a pipe that will process 10 documents on each iteration DocumentPipe pipe = new DocumentPipeImpl(10);
The page size argument is important when you are running the pipe on a machine different than the one containing the source of the data (the one from where the reader will fetch data). This way you can fetch several documents at once improving performances.
Create a new DocumentReader that will be used to fetch data and put it into the pipe. Depending on the data you want to import you can choose between existing DocumentReader implementation or you may write your own if needed:
reader = new DocumentTreeReader(docMgr, src, true); pipe.setReader(reader);
In this example we use a DocumentTreeReader which will read an entire sub-tree form the repository rooted in 'src' document.
The docMgr
argument represent a session to the repository, the
'src' is the root of the tree to export and the 'true' flag means to
exclude the root from the exported tree.
Create a DocumentWriter
that will be used to write down the
outputed by the pipe.
writer = new XMLDirectoryWriter(new File("/tmp/export")); pipe.setWriter(writer);
In this example we instantiate a writer that will write exported data onto the file system as a folder tree.
Optionally you may add one or more Document Transformers to transform documents that enters the pipe.
MyTransformer transformer = new MyTransformer(); pipe.addTransformer(transformer);
And now run the pipe ...
pipe.run();
DocumentReader reader = null; DocumentWriter writer = null; try { DocumentModel src = getTestWorkspace(); reader = new DocumentTreeReader(docMgr, root, true); writer = new NuxeoArchiveWriter(new File("/tmp/export.zip")); // creating a pipe DocumentPipe pipe = new DocumentPipeImpl(10); pipe.setReader(reader); pipe.setWriter(writer); pipe.run(); } finally { if (reader != null) { reader.close(); } if (writer != null) { writer.close(); } }
DocumentReader reader = null; DocumentWriter writer = null; try { DocumentModel src = getTestWorkspace(); reader = new ZipReader(new File("/tmp/export.zip")); writer = new DocumentModelWriter(docMgr, "import-domain/Workspaces/ws"); // creating a pipe DocumentPipe pipe = new DocumentPipeImpl(10); pipe.setReader(reader); pipe.setWriter(writer); pipe.run(); } finally { if (reader != null) { reader.close(); } if (writer != null) { writer.close(); } }
DocumentReader reader = null; DocumentWriter writer = null; try { DocumentModel src = getTestWorkspace(); reader = new SingleDocumentReader(docMgr, src); // inline blobs ((DocumentTreeReader)reader).setInlineBlobs(true); writer = new XMLDocumentWriter(new File("/tmp/export.zip")); // creating a pipe DocumentPipe pipe = new DocumentPipeImpl(); // optionally adding a transformer pipe.addTransformer(new MyTransformer()); pipe.setReader(reader); pipe.setWriter(writer); pipe.run(); } finally { if (reader != null) { reader.close(); } if (writer != null) { writer.close(); } }