[Prev: CHARSETALIASES]

[Resources][TOC]

[Next: CHECKNOARCHIVE]

CHARSETCONVERTERS

Syntax

Envariable: N/A
Element: <CHARSETCONVERTERS>
charset-filter-specification
</CHARSETCONVERTERS>
Command-line Option: N/A

Description

The CHARSETCONVERTERS resource specifies Perl routines to call for filtering characters of a character set to legal HTML characters. The filtering occurs for message header data encoded according to the MIME standard. The following example shows a header with encoded data:

From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard <PIRARD@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
 =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

CHARSETCONVERTERS resource is also used by text-based MIMEFILTERS for message body text.

The CHARSETCONVERTERS resource can only be defined via the resource file. Each line of the element specifies a character set, the Perl routine for filtering the character set, and the Perl source file containing the routine.

Example:

<CharsetConverters>
iso-8859-1; MHonArc::CharEnt::str2sgml; MHonArc/CharEnt.pm
</CharsetConverters>

The first field is the character set specification. The second field is the routine name (which should contain a package qualifier). The third field is the source file the routine is defined. The source file is searched for as defined by the PERLINC resource.

There are some special character set specifications. They are as follows:

plain: This specifies text that is not explicitly encoded in a specific character set. The MIME RFCs specify that unencoded data should be treated as us-ascii. However, in some locales, this may not be the case.
default: The default routine to invoke for encoded data if no converter is defined for the given character set.

There are some special character set converter routines values. They are as follows:

-ignore-

Leave the data "as-is". I.e. The MIME encoding will be preserved.

-decode-

Just decode the data. This is useful if it is known that the characters set is the native character set for the system.

WARNING:

If the decoded data contains the characters '<', '>', and '&', this may conflict with HTML markup. -decode- should only be used if DECODEHEADS is active. See Examples below and DECODEHEADS for example uses of -decode-.

Each charset converter function is invoked as follows:

$converted_data = &function($data, $charset);

The data passed in will already be decoded from quoted-printable or base64 (as specified by the MIME syntax). Therefore, the called routine will be passed the raw byte data. It is important that the routine convert the data into a format suitable for inclusion within HTML markup.

Available Converters

The standard MHonArc distribution provides the following converters:

`mhonarc::htmlize`

Usage

<CharsetConverters>
charset-name; mhonarc::htmlize
</CharsetConverters>

mhonarc::htmlize is provided by the MHonArc core code base, so no source file specification is required.

Description

mhonarc::htmlize does a simple replacement of HTML special characters into entity references. The characters '<', '>', '&', and '"' are converted to '<', '>', '&', and '"', respectively.

This converter is appropriate for us-ascii data and for situations where the given character set is an 8-bit set that matches the locale settings for the archives. For example, if an archive contains iso-8859-7 (Greek) text data and archive readers' browsers are set to iso-8859-7 as the default encoding, then mhonarc::htmlize can be used to prevent the overhead of Greek characters being converted to entity references.

If you will be managing archives that will include messages with multiple character encodings, it is recommend to limit the use of mhonarc::htmlize to us-ascii only.

`MHonArc::CharEnt::str2sgml`

Usage

<CharsetConverters>
charset-name; MHonArc::CharEnt::str2sgml; MHonArc/CharEnt.pm
</CharsetConverters>

Description

MHonArc::CharEnt::str2sgml converts a variety of character encodings into HTML 4 standard character entity references (e.g. &#Aelig;) and/or Unicode character entity references (e.g. Ž). Characters in the us-ascii domain are left as-is, with the exception of HTML specials, which are converted like mhonarc::htmlize. MHonArc::CharEnt::str2sgml attempts to be locale neutral and should be sufficient for most locales.

The following character sets/encodings are supported:

Charset/encoding	Description
`us-ascii`	US ASCII
`iso-8859-1`	Latin 1
`iso-8859-2`	Latin 2
`iso-8859-3`	Latin 3
`iso-8859-4`	Latin 4
`iso-8859-5`	Cyrillic
`iso-8859-6`	Arabic
`iso-8859-7`	Greek
`iso-8859-8`	Hebrew
`iso-8859-9`	Latin 5
`iso-8859-10`	Latin 6
`iso-8859-11`	Thai
`iso-8859-13`	Latin 7 (Baltic Rim)
`iso-8859-14`	Latin 8 (Celtic)
`iso-8859-15`	Latin 9 (aka Latin 0)
`iso-8859-16`	Latin 10
`iso-2022-jp`	Japanese
`iso-2022-kr`	Korean
`euc-jp`	Japanese
`utf-8`	Unicode UTF-8
`cp866`	MS-DOS Cyrillic
`cp932`	Japanese (Shift-JIS)
`cp936`	Chinese (GBK)
`cp949`	Korean
`cp950`	Windows Chinese
`cp1250`	Windows Latin 2
`cp1251`	Windows Cyrillic
`cp1252`	Windows Latin 1
`cp1253`	Windows Greek
`cp1254`	Windows Turkish
`cp1255`	Windows Hebrew
`cp1256`	Windows Arabic
`cp1257`	Windows Baltic
`cp1258`	Windows Vietnamese
`koi-0`	Cyrillic
`koi-7`	Cyrillic
`koi8-a`	Cyrillic
`koi8-b`	Cyrillic
`koi8-e`	Cyrillic
`koi8-f`	Cyrillic
`koi8-r`	Cyrillic
`koi8-u`	Cyrillic
`gost-19768-87`	Cyrillic
`viscii`	Vietnamese
`big5-eten`	Chinese (Taiwan)
`big5-hkscs`	Chinese (Hong Kong)
`gb2312`	Chinese
`macarabic`	Apple Arabic
`maccentraleurroman`	Apple Central Europe
`maccroatian`	Apple Croatian
`maccyrillic`	Apple Cyrillic
`macgreek`	Apple Greek
`machebrew`	Apple Hebrew
`macicelandic`	Apple Icelandic
`macromanian`	Apple Romanian
`macroman`	Apple Roman (Latin)
`macthai`	Apple Thai
`macturkish`	Apple Turkish
`hp-roman8`	HP Roman (Latin)

Most of the above listed charsets are also known by different names. See the CHARSETALIASES resource for details.

`MHonArc::UTF8::str2sgml`

Usage

<CharsetConverters override>
plain;    mhonarc::htmlize
default;  MHonArc::UTF8::str2sgml; MHonArc/UTF8.pm
</CharsetConverters>

<-- Need to also register UTF-8-aware text clipping function -->
<TextClipFunc>
MHonArc::UTF8::clip; MHonArc/UTF8.pm
</TextClipFunc>

Description

MHonArc::UTF8::str2sgml converts data to UTF-8. With HTML specials converted to entity references like mhonarc::htmlize.

Typical usages is to have it registered for all charsets, since only one TEXTCLIPFUNC can be specified. Having a mixture of UTF-8 and non-UTF-8 data can cause clipping problems in resource variables that specify a length specifier.

See the utf-8.mrc example resource file more details on how this converter can be used.

`iso_2022_jp::str2html`

Usage

<CharsetConverters>
iso-2022-jp; iso_2022_jp::str2html; iso2022jp.pl
</CharsetConverters>

Description

iso_2022_jp::str2html is designed to work with iso-2022-jp within a Japanese locale. iso_2022_jp::str2html preserves the iso-2022-jp encoding format, but converts HTML specials into character entity references similiar to mhonarc::htmlize.

NOTE:

If using iso_2022_jp::str2html, you should also use the iso-2022-jp text clipping function:

<TextClipFunc>
iso_2022_jp::clip; iso2022jp.pl
</TextClipFunc>

Some Japanese-aware processing tools do not support Unicode character entity references, like those generated by MHonArc::CharEnt::str2sgml, so the iso_2022_jp::str2html may be prefered over MHonArc::CharEnt::str2sgml for handling iso-2022-jp data.

For more information about using MHonArc in a Japanese locale, see (documents in Japanese): <http://www.mhonarc.jp/>

Default Setting

NOTE:

As of MHonArc v2.6.0, filters should only be defined for base charsets. The CHARSETALIASES resource can be used to map alternate names for base charsets.

<CharsetConverters>
plain;		    mhonarc::htmlize;
us-ascii;	    mhonarc::htmlize;
iso-8859-1;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-2;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-3;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-4;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-5;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-6;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-7;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-8;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-9;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-10;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-11;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-13;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-14;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-15;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-8859-16;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-2022-jp;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
iso-2022-kr;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
euc-jp;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
utf-8;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp866;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp936;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp949;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp950;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1250;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1251;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1252;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1253;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1254;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1255;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1256;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1257;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
cp1258;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi-0;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi-7;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-a;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-b;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-e;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-f;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-r;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
koi8-u;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
gost-19768-87;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
viscii;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
big5-eten;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
big5-hkscs;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
gb2312;		    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macarabic;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
maccentraleurroman; MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
maccroatian;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
maccyrillic;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macgreek;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
machebrew;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macicelandic;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macromanian;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macroman;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macthai;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
macturkish;	    MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
hp-roman8;          MHonArc::CharEnt::str2sgml;     MHonArc/CharEnt.pm
default;            -ignore-
</CharsetConverters>

Resource Variables

N/A

Examples

The following example tells MHonArc to just decode iso-8859-1 character data since it is the default character set used by most browsers:

<DecodeHeads>
<CharsetConverters>
iso-8859-1;-decode-
</CharsetConverters>

MHonArc's MHonArc::CharEnt module supports the conversion of many major character sets, including UTF-8 data, into standard HTML character entity references (e.g. &Aelig;) and numeric Unicode character references (e.g. ‾). However, if you want archive pages to be in native UTF-8, see the utf-8.mrc resource file example.

Version

2.0

See Also

CHARSETALIASES, DECODEHEADS, MIMEDECODERS, MIMEFILTERS, PERLINC, TEXTCLIPFUNC, TEXTENCODE

[Prev: CHARSETALIASES]

[Resources][TOC]

[Next: CHECKNOARCHIVE]

$Date: 2005/05/13 18:50:38 $

MHonArc
Copyright © 1997-2001, Earl Hood, mhonarc@mhonarc.org