Transformation Formats in Using Character Conversion (CHARCONV)

The UCS-2 form of the Unicode character set encodes each character as 2 bytes (16 bits total). However it does not specify which of the bytes is most significant. The byte order, or endian-ness, is left up to the discretion of a particular operating system.

While this is not important within a system, it does mean that text encoded as UCS-2 cannot easily be shared between systems using a different endian-ness. To overcome this problem the Unicode Consortium has defined two transformation formats for sharing Unicode text. The transformation formats explicitly specify byte order, and cannot be misinterpreted by computers using a different byte order.

The two transformation formats, UTF-7 and UTF-8, are described below. For the full definition of these formats, see The Unicode Standard published by the Unicode Consortium.

UTF-7 allows Unicode characters to be encoded and transmitted as 8-bit bytes, of which only 7 bits are used. UTF-7 divides the set of Unicode characters into three subsets, which are encoded and transmitted differently.

Set D, is the set of characters which are encoded as a single byte. It includes lower and upper case A to Z, the numeric digits, and nine other characters.
Set O includes the characters ! " # $ % & * ; < = > @ [ ] ^ _ { | }. These characters can be encoded as a single byte, or with the modified base 64encoding used for set B characters. When encoded as a single byte, set O characters can be misinterpreted by some applications encoding as modified base 64 overcomes this problem.
Set B comprises the remaining characters, which are encoded as an escape byte followed by 2 or 3 bytes. The encoding format is a modified form of base 64 encoding.

UTF-8 encodes and transmits Unicode characters as a string of 8-bit bytes. All the ASCII characters 0 to 127 are encoded without change; the most significant bit being set to zero is a signal that they have not been changed. Unicode characters U0080 to U07FF are encoded in two bytes, the remaining Unicode characters except for the surrogates are encoded in three bytes. The Unicode surrogate characters are supported by the Character Conversion API, but are not currently supported by all Symbian OS components.

A variant of UTF-8 used internally by Java differs from standard UTF-8 in two ways. First, the specific case of the NULL character (0x0000) is encoded in the two-byte format, and second, only the one-, two- and three-byte formats are used, not the four-byte format which is normally used for Unicode surrogate-pairs. An argument to ConvertFromUnicodeToUtf8 controls whether the UTF-8 generated by this is the Java variant. Support for this was removed in v6.0.

Transformation formats

UTF-7

UTF-8