Table of Contents
The org.nuxeo.ecm.platform.mimetype.*
packages
give all the tools to find the mimetype of a document. The package
provides two guessing approach: using file extensions and
using the guessing library Jmimemagic (third party tool providing been
enhanced detection methods, based on the binary signature of files).
All the recognized mimetypes are stored in the
MimetypeRegistry
. Each mimetype definition is a
contribution to the mimetype
extension point of the
org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService
component.
<component name="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService"> <extension target="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService" point="mimetype"> <mimetype normalized="application/vnd.oasis.opendocument.text" binary="true" iconPath="odt.png" oleSupported="true"> <mimetypes> <mimetype>application/vnd.oasis.opendocument.text</mimetype> </mimetypes> <extensions> <extension>odt</extension> </extensions> </mimetype> </extension> </component>
A mimetype
node, bound to a
MimetypeDescriptor
defines a normalized mimetype with
the following informations:
normalized
: the mimetype entry that is
described and that will be returned
binary
: a boolean that indicates if the file
is a binary one
iconPath
: the filename of the image
representing the icon
oleSupported
: this file mimetype is supported
by the oleExtract
transform plugin - default is
False
onlineEditable
: this mimetype is supported by
online Edit
- default is
False
mimetypes
: list of mimetypes bound to this
normalized mimetype
extensions
: list of extensions bound to this
normalized mimetype
An other defined extension point is extension
that allow to register extensions that are ambiguous to force mimetype
sniffing.
<component name="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService"> <extension target="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService" point="extension"> <fileExtension name="xml" mimetype="text/xml" ambiguous="true" /> </extension> </component>
A fileExtension
node, bound to
ExtensionDescriptor
, has the following
attributes:
name
: the file extension
mimetype
: the associated mimetype
ambiguous
: force the mimetype sniffing
At the MimetypeRegistry
level, methods are
provided to dynamically register (or unregister) mimetypes or extensions
by code
public void testMimetypeRegistration() { MimetypeEntry mimetype = getMimetypeSample(); mimetypeRegistry.registerMimetype(mimetype); assertEquals( mimetypeRegistry.getMimetypeEntryByName(mimetype.getNormalized()), mimetype); } private static MimetypeEntryImpl getMimetypeSample() { String normalizedMimetype = "application/msword"; List<String> mimetypes = new ArrayList<String>(); mimetypes.add("application/msword"); // fake mimetypes.add("app/whatever-word"); List<String> extensions = new ArrayList<String>(); extensions.add("doc"); extensions.add("xml"); String iconPath = "icons/doc.png"; boolean binary = true; boolean onlineEditable = true; boolean oleSupported = true; return new MimetypeEntryImpl(normalizedMimetype, mimetypes, extensions, iconPath, binary, onlineEditable, oleSupported); }
The registerMimetype
method, with a
MimetypeEntry
argument, adds the given entry to the
registry. After adding the mimetype, one can see that the
MimetypeEntry
retrieved using the initial Normalized
name is correct. The getMimetypeSample
method, helper
for the TestMimetyepRegistryService
test class, shows a
definition of a MimetypeEntry
by code.
Based on the previous entry definition, two registries are built allowing to retrieve the normalized mimetype through file extension (efficient) or by sniffing (accurate).
This section describes the underlying process involved in mimetype detections.
The file extension detection rely on the filename that has to be correct. As it is a simple string to associate to the mimetype, we do not develop further. Just note that the filename has to be correctly named and have an extension to retrieve a mimetype with this method.
The sniffing tries to analyse the file content itself in order to
guess the mimetype. The third-party tool Jmimemagic
is
used for this and widely enhanced on defining new supported mimetypes and
detectors. The Jmimemagic
tool uses 2 files,
magic.xml
and magic_1_0.dtd
. We
redefine the XML
one to add new detections (no
extension-point there defined from Jmimemagic
too ;-)
l).
Basically, the default mimetype sniffing is based on searching a
sequence of characters (or binary values) at a specified
offset
.
<match> <mimetype>application/pdf</mimetype> <extension>pdf</extension> <description>PDF document</description> <test offset="0" type="string" comparator="=">%PDF-</test> </match>
A match
is the definition of a magic entry. It
contains a mimetype
, an extension
and a textual description
of the defined mimetype. A
test
node, containing the operation to perform is also
defined. Here it declares that for an application/pdf
mimetype, the file has to contain the string %PDF-
at
offset
0
.
if this method is usually suitable for a lot of files (i.e. one can find some invariants in the format), when used with more complex ones, a simple offset (or a combination) is not enough and we have to refine the detection algorithm. That is what detectors are made for and we have defined some for the 2 major office file formats, MsOffice and OpenOffice.org.
For OpenOffice.org, the zip detection is enhanced
<match> <mimetype>application/zip</mimetype> <extension>zip</extension> <description>Zip archive data</description> <test offset="0" type="string" comparator="=">PK\003\004</test> <match-list> <!-- opendocument & OOo 1.x --> <match> <mimetype>OOo</mimetype> <extension>OOo</extension> <description>OOo 1.x and OpenDocument file</description> <test type="detector" offset="0" length="" bitmask="" comparator="="> org.nuxeo.ecm.platform.mimetype.detectors.OOoMimetypeSniffer </test> </match> </match>
First a simple offset
detection is performed to
qualify a zip
file, then a sub-match is defined. The
detector type indicates that the
org.nuxeo.ecm.platform.mimetype.detectors.OOoMimetypeSniffer
has to be called and this is its responsibility to give all the valid
information (mimetype
, extension
,
description
) if the file is of correct type.
For MS-Office files, the "magic numbers" (the value to be found at a
certain offset) are not that clear, as the magic number defined by
Jmimemagic
(based on the file
Linux
command resource file) is the same for all MS-Office application. Then we
invoke a detector for each component that uses the POI
library to detect what file we deal with.
<match> <mimetype>application/msword</mimetype> <extension>doc</extension> <description>Microsoft Office Document</description> <test offset="0" type="string" comparator="=">\320\317\021\340\241\261</test> <match-list> <!-- XLS file by detector --> <match> <mimetype>application/vnd.ms-excel</mimetype> <extension>xls</extension> <description>Excel File</description> <test type="detector" offset="0" length="" bitmask="" comparator="="> org.nuxeo.ecm.platform.mimetype.detectors.XlsMimetypeSniffer </test> </match> <!-- PPT file by detector --> <match> <mimetype>application/vnd.ms-powerpoint</mimetype> <extension>ppt</extension> <description>Powerpoint File</description> <test type="detector" offset="0" length="" bitmask="" comparator="="> org.nuxeo.ecm.platform.mimetype.detectors.PptMimetypeSniffer </test> </match> </match-list> </match>
Once a Microsoft Office Document
has been
detected at offset 0
in the first match, two
sub-matches detectors are defined for
application/vnd.ms-excel
(org.nuxeo.ecm.platform.mimetype.detectors.XlsMimetypeSniffer
)
and application/vnd.ms-powerpoint
(org.nuxeo.ecm.platform.mimetype.detectors.PptMimetypeSniffer
).
If none returns a correct mimetype, the only possibility remains then
application/msword
. (This may be lightly refactored for
simplicity in a near future).
A detector is a class that implements
net.sf.jmimemagic.MagicDetector
. The public
process
method has to detect if the file fulfills the
condition. If successful, it returns the mimetypes supported by this file.
The public methods getHandledExtensions
and
getHandledTypes
define the String
arrays that are used by Jmimemagic to build the final match.
All the detection is based on MimetypeRegistry
like object. When invoked, the registry is populated with the information
that has been exposed previously. The registry implements the interface
org.nuxeo.ecm.platform.mimetype.interfaces.MimetypeRegistry
.
Once available, directly or from a bean, any File
or Blob
can be analyzed and the information retrieved
like the mimetype name or the supported extensions list.
import org.nuxeo.ecm.platform.mimetype.ejb.MimetypeRegistryBean; ... private MimetypeRegistryBean mimetypeRegistry; ... public void testSniffWordFromFile() throws Exception { File file = FileUtils.getResourceFileFromContext("test-data/hello.doc"); String mimetype = mimetypeRegistry.getMimetypeFromFile(file); assertEquals("application/msword", mimetype); List<String> extensions = mimetypeRegistry.getExtensionsFromMimetypeName(mimetype); assertTrue(extensions.contains("doc")); }
In the above example, the mimetypeRegistry
object
used is a MimetypeRegistryBean
. The purpose is to sniff
the mimetype of a given file. The MS-Word file is first read and the
getMimetypeFromFile
method is called. Once the mimetype
is retrieved, the getExtensionsFromMimetypeName
can be
called and it gives the associated extensions from the registry.
Note that the API of
org.nuxeo.ecm.platform.mimetype.interfaces.MimetypeRegistry
contains various ways to ask for a mimetype, dealing with
File
or Blob
objects, with or
without default responses. It is worth a look to avoid unneeded
work.