[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n
Chapter 3 - Important Concepts for Character Coding Systems


Character coding system is one of the fundamental elements of the software and information processing. Without proper handling of character codes, your software is far from realization of internationalization. Thus the author begins this document with the story on character codes.

In this chapter, basic concepts such as coded character set and encoding are introduced. These terms will be needed to read this document and other documents on internationalization and character codes including Unicode.


3.1 Basic Terminology

At first I begin this chapter by defining a few very important word.

As many people point out, there is a confusion on terminology, since words are used in various different ways. The author does not want to add a new terminology to a confusing ocean of various terminologies. Otherwise, terminology of RFC 2130 will be adopted in this document, besides one exception of a word 'character set'.

Character
Character is an individual unit of which sentence and text consist. Character is an abstract notion.
Glyph
Glyph is a specific instance of character. Character and glyph is a pair of words. Sometimes a character has multiple glyphs (for example, '$' may have one or two vertical bar. Arabic characters have four glyphs for each character. Some of CJK ideograms have many glyphs). Sometimes two or more characters construct one glyph (for example, ligature of 'fi'). For almost cases, text data, which intend to contain not visual information but abstract idea, don't have to have information on glyphs, since difference between glyphs does not affect the meaning of the text. However, distinction between different glyphs for a single CJK ideogram may be sometimes important for proper noun such as names of persons and places. However, there are no standardized method for plain text to have informations on glyphs so far. This makes plain texts cannot be used for some special fields such as citizen registration system, serious DTP such as newspaper system, and so on.
Encoding
Encoding is a rule where characters and texts are expressed in combinations of bits or bytes in order to treat characters in computers. Words of character coding system, character code, charset, and so on are used to express the same meaning. Basically, encoding takes care of characters, not glyphs. There are many official and de-facto standards of encodings such as ASCII, ISO 8859-{1,2,...,15}, ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2}, EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620, VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE, UTF-16BE, KOI8-R, and so on so on. To construct an encoding, we have to consider the following concepts. (Encoding = one or more CCS + one CES).
Character Set
Character set is a set of characters. This determines a range of characters where the encoding can handle. In contrast to coded character set, this is often called as non-coded character set.
Coded Character Set (CCS)
Coded character set (CCS) is a word defined in RFC 2050 and means a character set where all characters have unique numbers by some method. There are many national and international standards for CCS. Many national standards for CCS adopt the way of coding so that they obey some of international standards such as ISO 646 or ISO 2022. ASCII, BS 4730, JISX 0201 Roman, and so on are examples of ISO-646 variants. All ISO-646 variants, ISO 8859-*, JISX 0208, JISX 0212, KSX 1001, GB 2312, CNS 11643, CCCII, TIS 620, TCVN 5712, and so on are examples of ISO 2022-compliant CCS. VISCII and Big5 are examples of non-ISO 2022-compliant CCS. UCS-2 and UCS-4 (ISO 10646) are also examples of CCS.
Character Encoding Scheme (CES)
Character Encoding Scheme is also a word defined in RFC 2050 to call methods to construct an encoding using one or more CCS. This is important when two or more CCS are used to construct an encoding. ISO 2022 is a method to construct an encoding from one or more ISO 2022-compliant CCS. ISO 2022 is very complex system and subsets of ISO 2022 are usually used such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII and KSX 1001), and so on. CES is not important for encodings with only one 8bit CCS. UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be regarded as CES whose CCS is Unicode or ISO 10646.

Some other words are usually used related to character codes.

Character code is a widely-used word to mean encoding. This is an primitive and crude word to call the way a computer handles characters with assigning numbers. For example, character code can call encoding and can call coded character set. Thus this word can be used only in the case when both of them can be regard in the same category. This word should be avoided in serious discussions. This document will not use this word hereafter.

Codeset is a word to call encoding or character encoding scheme. [3]

charset is also a well-used word. This word is used very widely, for example, in MIME (like Content-Type: text/plain, charset=iso8859-1), in XLFD (X Logical Font Description) font name (CharSetResigtry and CharSetEncoding fields), and so on. Note that charset in MIME is encoding, while charset in XLFD font name is coded character set. This is very confusing. In this document, charset and character set are used in XLFD meaning, since I think character set should mean a set of characters, not encoding.

Ken Lunde's "CJKV Information Processing" uses a word encoding method. He says that ISO-2022, EUC, Big5, and Shift-JIS are examples of encoding methods. It seems that his encoding method is CES in this document. However, we should notice that Big5 and Shift-JIS are encodings while ISO-2022 and EUC are not. [4]

Character Encoding Model, Unicode Technical Report #17 (hereafter, "the Report") suggests five-level model.

TES is also suggested in RFC 2130. Some examples of TES are: base64, uuencode, BinHex, quoted-printable, gzip, and so on. TES means a transform of encoded data which may (or may not) include textual data. Thus, TES is not a part of character encoding. However, TES is important in the Internet data exchange.

When using a computer, we rarely have a chance to face with ACR. Though it is true that CJK people have their national standard of ACR (for example, standard for ideograms which can be used for personal names) and some of us may need to handle these ACR with computers (for example, citizen registration system), this is too heavy theme for this document. This is because there are no standardized or encouraged methods to handle these ACR. You may have to build the whole system for such purposes. Good luck!

CCS in "the Report" is same as what I wrote in this document. It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201, JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5, CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on. Some of them are national standards, some are international standards, and others are de-facto standards.

CEF and CES in "the Report" correspond to CES in this document. This document will not distinguish these two, since I think there are no inconvenience. An encoding with a significant CEF doesn't have a significant CES (in "the Report" meaning), and vice versa. Then why should we have to distinguish these two? The only exception is UTF-16 series. In UTF-16 series, UTF-16 is a CEF and UTF-16BE is a CES. This is the only case where we need distinction between CEF and CES.

Now, CES is a concrete concept with concrete examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP, ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT, ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE, and so on. Now they are encodings themselves.

The most important concept in this section is distinction between coded character set and encoding. Coded character set is a component of encoding. Text data are described in encoding, not coded character set.


3.2 Stateless and Stateful

To construct an encoding with two or more CCS, CES has to supply a method to avoid collision between these CCS. There are two ways to do that. One is to make all characters in the all CCS have unique code points. The other is to allow characters from different CCS to have the same code point and to have a code such as escape sequence to switch SHIFT STATE, that is, to select one character set.

An encoding with shift states is called STATEFUL and one without shift states is called STATELESS.

Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR, ISO 2022-INT-1, ISO 2022-INT-2, and so on.

For example, in ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA' or two ASCII character of '$' and ',' according to the shift state.


3.3 Multibyte encodings

Encodings are classified into multibyte ones and the others, according to the relationship between number of characters and number of bytes in the encoding.

In non-multibyte encoding, one character is always expressed by one byte. On the other hand, one character may expressed in one or more bytes in multibyte encoding. Note that the number is not fixed even in a single encoding.

Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP, Shift-JIS, Big5, UHC, UTF-8, and so on. Note that all of UTF-* are multibyte.

Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2, TIS 620, VISCII, and so on.

Note that even in non-multibyte encoding, number of characters and number of bytes may differ if the encoding is stateful.

Ken Lunde's "CJKV Information Processing" [5] classifies encoding methods into the following three categories:

Modal corresponds to stateful in this document. Other two are stateless, where non-modal is multibyte and fixed-length is non-multibyte. However, I think stateful - stateless and multibyte - non-multibyte are independent concept. [6]


3.4 Number of Bytes, Number of Characters, and Number of Columns

One ASCII character is always expressed by one byte and occupies one column on console or X terminal emulators (fixed font for X). One must not make such an assumption for I18N programming and have to clearly distinguish number of bytes, characters, and columns.

Speaking of relationship between characters and bytes, in multibyte encodings, two or more bytes may be needed to express one character. In stateful encodings, escape sequences are not related to any characters.

Number of columns is not defined in any standards. However, it is usual that CJK ideograms, Japanese Hiragana and Katakana, and Korean Hangul occupy two columns in console or X terminal emulators. Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set will occupy two columns and 'Half-width forms' will occupy one column. Combining characters used for Thai and so on can be regarded as zero-column characters. Though there are no standards, you can use wcwidth() and wcswidth() for this purpose. See Number of Columns, Section 7.1.2 for detail.


[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ next ]

Introduction to i18n


17 June 2006

Tomohiro KUBOTA [email protected]