Mimetype detection

Table of Contents

21.1. Introduction
21.2. MimetypeRegistry
21.3. Mimetype sniffing
21.4. Invoking the mimetype detection

21.1. Introduction

The org.nuxeo.ecm.platform.mimetype.* packages give all the tools to find the mimetype of a document. The package provides two guessing approach: using file extensions and using the guessing library Jmimemagic (third party tool providing been enhanced detection methods, based on the binary signature of files).

21.2. MimetypeRegistry

All the recognized mimetypes are stored in the MimetypeRegistry. Each mimetype definition is a contribution to the mimetype extension point of the org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService component.

<component
  name="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService">
  <extension
    target="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService"
    point="mimetype">
    <mimetype normalized="application/vnd.oasis.opendocument.text"
      binary="true" iconPath="odt.png" oleSupported="true">
      <mimetypes>
        <mimetype>application/vnd.oasis.opendocument.text</mimetype>
      </mimetypes>
      <extensions>
        <extension>odt</extension>
      </extensions>
    </mimetype>
  </extension>
</component>

A mimetype node, bound to a MimetypeDescriptor defines a normalized mimetype with the following informations:

normalized: the mimetype entry that is described and that will be returned
binary: a boolean that indicates if the file is a binary one
iconPath: the filename of the image representing the icon
oleSupported: this file mimetype is supported by the oleExtract transform plugin - default is False
onlineEditable: this mimetype is supported by online Edit - default is False
mimetypes: list of mimetypes bound to this normalized mimetype
extensions: list of extensions bound to this normalized mimetype

An other defined extension point is extension that allow to register extensions that are ambiguous to force mimetype sniffing.

<component
  name="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService">
  <extension
    target="org.nuxeo.ecm.platform.mimetype.service.MimetypeRegistryService"
    point="extension">
    <fileExtension name="xml" mimetype="text/xml" ambiguous="true" />
  </extension>
</component>

A fileExtension node, bound to ExtensionDescriptor, has the following attributes:

name: the file extension
mimetype: the associated mimetype
ambiguous: force the mimetype sniffing

At the MimetypeRegistry level, methods are provided to dynamically register (or unregister) mimetypes or extensions by code

public void testMimetypeRegistration() {
    MimetypeEntry mimetype = getMimetypeSample();
    mimetypeRegistry.registerMimetype(mimetype);
    assertEquals(
            mimetypeRegistry.getMimetypeEntryByName(mimetype.getNormalized()),
            mimetype);
}

private static MimetypeEntryImpl getMimetypeSample() {

    String normalizedMimetype = "application/msword";

    List<String> mimetypes = new ArrayList<String>();
    mimetypes.add("application/msword");
    // fake
    mimetypes.add("app/whatever-word");

    List<String> extensions = new ArrayList<String>();
    extensions.add("doc");
    extensions.add("xml");

    String iconPath = "icons/doc.png";

    boolean binary = true;
    boolean onlineEditable = true;
    boolean oleSupported = true;

    return new MimetypeEntryImpl(normalizedMimetype, mimetypes, extensions,
            iconPath, binary, onlineEditable, oleSupported);
}

The registerMimetype method, with a MimetypeEntry argument, adds the given entry to the registry. After adding the mimetype, one can see that the MimetypeEntry retrieved using the initial Normalized name is correct. The getMimetypeSample method, helper for the TestMimetyepRegistryService test class, shows a definition of a MimetypeEntry by code.

Based on the previous entry definition, two registries are built allowing to retrieve the normalized mimetype through file extension (efficient) or by sniffing (accurate).

21.3. Mimetype sniffing

This section describes the underlying process involved in mimetype detections.

The file extension detection rely on the filename that has to be correct. As it is a simple string to associate to the mimetype, we do not develop further. Just note that the filename has to be correctly named and have an extension to retrieve a mimetype with this method.

The sniffing tries to analyse the file content itself in order to guess the mimetype. The third-party tool Jmimemagic is used for this and widely enhanced on defining new supported mimetypes and detectors. The Jmimemagic tool uses 2 files, magic.xml and magic_1_0.dtd. We redefine the XML one to add new detections (no extension-point there defined from Jmimemagic too ;-) l).

Basically, the default mimetype sniffing is based on searching a sequence of characters (or binary values) at a specified offset.

<match>
  <mimetype>application/pdf</mimetype>
  <extension>pdf</extension>
  <description>PDF document</description>
  <test offset="0" type="string" comparator="=">%PDF-</test>
</match>

A match is the definition of a magic entry. It contains a mimetype, an extension and a textual description of the defined mimetype. A test node, containing the operation to perform is also defined. Here it declares that for an application/pdf mimetype, the file has to contain the string %PDF- at offset 0.

if this method is usually suitable for a lot of files (i.e. one can find some invariants in the format), when used with more complex ones, a simple offset (or a combination) is not enough and we have to refine the detection algorithm. That is what detectors are made for and we have defined some for the 2 major office file formats, MsOffice and OpenOffice.org.

For OpenOffice.org, the zip detection is enhanced

<match>
  <mimetype>application/zip</mimetype>
  <extension>zip</extension>
  <description>Zip archive data</description>
  <test offset="0" type="string" comparator="=">PK\003\004</test>
  <match-list>
  <!--  opendocument & OOo 1.x -->
    <match>
      <mimetype>OOo</mimetype>
      <extension>OOo</extension>
      <description>OOo 1.x and OpenDocument file</description>
      <test type="detector" offset="0" length="" bitmask="" comparator="=">
           org.nuxeo.ecm.platform.mimetype.detectors.OOoMimetypeSniffer
      </test>
    </match>
</match>

First a simple offset detection is performed to qualify a zip file, then a sub-match is defined. The detector type indicates that the org.nuxeo.ecm.platform.mimetype.detectors.OOoMimetypeSniffer has to be called and this is its responsibility to give all the valid information (mimetype, extension, description) if the file is of correct type.

For MS-Office files, the "magic numbers" (the value to be found at a certain offset) are not that clear, as the magic number defined by Jmimemagic (based on the file Linux command resource file) is the same for all MS-Office application. Then we invoke a detector for each component that uses the POI library to detect what file we deal with.

<match>
  <mimetype>application/msword</mimetype>
  <extension>doc</extension>
  <description>Microsoft Office Document</description>
  <test offset="0" type="string" comparator="=">\320\317\021\340\241\261</test>
  <match-list>
    <!--  XLS file by detector -->
    <match>
      <mimetype>application/vnd.ms-excel</mimetype>
      <extension>xls</extension>
      <description>Excel File</description>
      <test type="detector" offset="0" length="" bitmask="" comparator="=">
           org.nuxeo.ecm.platform.mimetype.detectors.XlsMimetypeSniffer
      </test>
    </match>
   <!-- PPT file by detector -->
    <match>
      <mimetype>application/vnd.ms-powerpoint</mimetype>
      <extension>ppt</extension>
      <description>Powerpoint File</description>
      <test type="detector" offset="0" length="" bitmask="" comparator="=">
           org.nuxeo.ecm.platform.mimetype.detectors.PptMimetypeSniffer
      </test>
    </match>
  </match-list>
</match>

Once a Microsoft Office Document has been detected at offset 0 in the first match, two sub-matches detectors are defined for application/vnd.ms-excel (org.nuxeo.ecm.platform.mimetype.detectors.XlsMimetypeSniffer) and application/vnd.ms-powerpoint (org.nuxeo.ecm.platform.mimetype.detectors.PptMimetypeSniffer). If none returns a correct mimetype, the only possibility remains then application/msword. (This may be lightly refactored for simplicity in a near future).

A detector is a class that implements net.sf.jmimemagic.MagicDetector. The public process method has to detect if the file fulfills the condition. If successful, it returns the mimetypes supported by this file. The public methods getHandledExtensions and getHandledTypes define the String arrays that are used by Jmimemagic to build the final match.

21.4. Invoking the mimetype detection

All the detection is based on MimetypeRegistry like object. When invoked, the registry is populated with the information that has been exposed previously. The registry implements the interface org.nuxeo.ecm.platform.mimetype.interfaces.MimetypeRegistry.

Once available, directly or from a bean, any File or Blob can be analyzed and the information retrieved like the mimetype name or the supported extensions list.

import org.nuxeo.ecm.platform.mimetype.ejb.MimetypeRegistryBean;
...
    private MimetypeRegistryBean mimetypeRegistry;
...
    public void testSniffWordFromFile() throws Exception {

        File file = FileUtils.getResourceFileFromContext("test-data/hello.doc");

        String mimetype = mimetypeRegistry.getMimetypeFromFile(file);
        assertEquals("application/msword", mimetype);

        List<String> extensions = mimetypeRegistry.getExtensionsFromMimetypeName(mimetype);
        assertTrue(extensions.contains("doc"));
    }

In the above example, the mimetypeRegistry object used is a MimetypeRegistryBean. The purpose is to sniff the mimetype of a given file. The MS-Word file is first read and the getMimetypeFromFile method is called. Once the mimetype is retrieved, the getExtensionsFromMimetypeName can be called and it gives the associated extensions from the registry.

Note that the API of org.nuxeo.ecm.platform.mimetype.interfaces.MimetypeRegistry contains various ways to ask for a mimetype, dealing with File or Blob objects, with or without default responses. It is worth a look to avoid unneeded work.

Prev	Home	Next
Chapter 20. Binary content	Professional Open Source ECM by Nuxeo	Chapter 22. Content Transformation