MMBase Internationalization and Localization

Author: Michiel Meeuwissen
Date: 2002-10-19

This software is OSI Certified Open Source Software. OSI Certified is a certification mark of the Open Source Initiative.

The license (Mozilla version 1.0) can be read at the MMBase site. See http://www.mmbase.org/license

Table of Contents

1. Introduction

2. Internationalization - Encodings

2.1. Non-ascii in Images

3. Localization - Translations

3.1. Builders
3.2. Generic JSP-editors
3.3. Editwizards

4. Language setting

1. Introduction

Since MMBase is basically a server-product, there is only little user-interface, and the issue of which human language (or other `locale' setting) is in use is relatively unimportant. There can be said a few words about it though, and if you want to adapt MMBase to the needs of your country and language, then you can hopefully find what you need in this document.

Happily MMBase is relatively well `internationalized', which means that effort has been taken to make it possible to `localize' (i.e. basically: add translations) it easily. MMBase was developed in a Dutch environment, rather then an English-only one, which ensured that this issue was always kept in mind relatively well.

Part of the process of `internationalization' (also referred to as `i18n') of course is the possibility to represent all kind of characters, and on this subject the first section will elaborate.

The second part of this document will point out where the different `translatable' parts of MMBase are, and how to add or improve a translation.

2. Internationalization - Encodings

In MMBase 1.5 and before the default character encoding of the strings in the database was ISO-8859-1, which is a superset of ASCII. In `iso-1' every letter is presented by exactly one byte, so that there can be only 255 different letters and characters. Using ISO-8859-1 most western European language - including English, French, German and Dutch - can be presented without much trouble.

MMBase however is coded in Java, and Java strings can contain much more than 255 different characters. In fact the characters in Java strings must only be in the Unicode character set, which is a huge set, sufficient to present texts in virtually all languages of the world. See http://www.unicode.org.

Therefore in MMBase 1.6 the encoding of the database strings was made configurable. Of course this is only needed if the database (and jdbc driver) does not understand unicode encodings by itself. E.g. mysql does not support unicode by itself, so MMBase has to force it.

Default this 'encoding' parameter (which can be found in mmbaseroot.xml) is now set to UTF-8, which is the most common unicode encoding. UTF-8 is a multibyte encoding, which means that one character can be presented by 1 or more bytes. It has been arranged that the ASCII letters are presented by the same bytes, which means that texts encoded in ASCII are also encoded in UTF-8, so UTF-8 is largely backwards-compatible.

Most configuration in MMBase is done by use of XML nowadays. Unicode characters can always be presented in XML documents by the use of `entities' or by the use of UTF-8 (the default character encoding for XML-documents, IIRC).

Though Java strings can contain any unicode character, usually Java-code cannot be in e.g. UTF-8. I think java code must be in ISO-8859-1 (right?), so if a unicode character outside the range of ISO-8859-1 has to be presented in Java source code then it has to be escaped by the use of the \u notation.

This also is valid for Java `property files'. Property files are not seldom used as resources for translations, so it is important to realize that they must be encoded in ISO-8859-1, but that it is possible to use characters from the complete unicode set. You might want to use a tool to do the escaping for you though if you try to type a language of which most characters are not in ISO-8859-1 (e.g. Greek, Russian and most Asian languages).

2.1. Non-ascii in Images

Usually the 'convert' tool of imagemagick is used to resize, turn or otherwise convert MMBase images. It is also possible in this way to add text to existing Images.

The java strings are then fed to the 'convert' tool by a runtime call. The convert-tool expects the string to be encoded according to UTF-8. It is not possible to tell this to Runtime.exec, neither is it clear how you should inform Java about this.

The current implementation works, also for non-ascii characters, if the java environment is configured for UTF-8. In Linux this is the case if e.g. the LC_ALL environment variable is set to "en_US.UTF-8" or "nl_NL.UTF-8" or so.

3. Localization - Translations

On a few parts in MMBase there can be found texts in different languages. A language is always indicated by a two letter code from ISO 639 (note that this is for languages, not countries).

The internationalization of MMBase currently only extends to the real user interface parts, by which is meant in practice `content editors'. There does not yet exist an internationalized administration user interface. Another thing which one might imagine subject to internationalization are error messages, specifically those which can appear to users and templaters; all error messages from the MMCI and the MMBase Taglib can now only appear in English (or broken English). Some of these messages, like Security exceptions (`permission denied'), can occasionally appear to editor users.

4. Language setting

In the mmbaseroot.xml also the `default language' can be set, which is used if no language is set.

The language can also be set explicitly in the Cloud of the bridge (using the 'setLocale' function). GUI-names obtained from such a cloud will be localized according to this setting (if possible)

The MMBase taglib provides the `locale' tag, which sets the language in cloud-tags in surrounds, but also informs other less MMBase specific tags (the time tag), about the locale settings it contains.