mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte character sets including Korean EUC-KR, Chinese Big5 and GB2312, Japanese Shift-JIS, EUC-JP and ISO-2022-JP, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are large which makes size of executable files bigger. See configure parameters to enable support for extra character sets.
mnoGoSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.
Table 7-1. Supported character sets
Languages | Character sets |
Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish | ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland |
Eastern Europe: Croatian, Czech, Hungarian, Polish, Romanian, Slovak, Slovene | CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian |
Baltic: Latvian, Lithuanian, Estonian | CP1257, ISO-8859-4, ISO-8859-13 |
Cyrillic: Bulgarian, Belorussian, Macedonian, Russian, Serbian, Ukrainian | CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic |
Arabic | CP864, CP1256, ISO 8859-6, MacArabic |
Greek | CP869, CP1253, ISO 8859-7, MacGreek |
Hebrew | CP1255, ISO 8859-8, MacHebrew |
Turkish | CP857, CP1254, ISO 8859-9, MacTurkish |
Japanese | Shift-JIS, EUC-JP, ISO-2022-JP |
Simplified Chinese | GB2312 |
Traditional Chinese | Big5 |
Korean | EUC-KR |
Thai | CP874, TIS 620, MacThai |
Vietnamese | CP1258 |
Indian | MacGujarati, TSCII |
Georgian | geostd8 |
Unicode: over 650 languages | UTF-8 |
mnoGoSearch allows to index documents in many languages into the same database. Disk space, required to store search data, depends on the choice of character set that mnoGoSearch will use. The character set is specified by the LocalCharset command value.
indexer converts all documents to the character set specified by the LocalCharset command in indexer.conf . Internally conversion is implemented using Unicode.
Some conversion procedures can loose some data. For example, conversion of a text file from Greek cp1253 to Russian cp1251 will loose all Greek characters. mnoGoSearch stores all characters which cannot be covered to LocalCharset using &#nnn; notation, where nnn is decimal code point of a character, according to Unicode. This helps to avoid data loss.
To avoid excessive use of disk space because of too many &#nnn; sequences (each requires from 5 to 7 bytes) it's important to choose a good value for LocalCharset. If your document collection consists of documents in many scripts, like Greek and Russian and German, the UTF-8 character set is probably the best choice for LocalCharset.
You can specify the BrowserCharset command to choose a character set which will be used to display search results. If BrowserCharset and LocalCharset have different values, mnoGoSearch will apply character set conversion. Similar to indexing time, if some characters cannot be converted to BrowserCharset, they will be displayed using &nnn; notation.
Each charset is recognized by a number of its aliases. Different web servers could return the same charset in different notations. For example, ISO-8859-2, ISO8859-2, latin2 are the same character sets. The search engine understands the following character set names aliases:
Table 7-2. Charsets aliases
ISO-2022-JP: | ISO-2022-JP |
ISO-8859-1: | CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1 |
ISO-8859-10: | CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6 |
ISO-8859-11: | ISO-8859-11, TIS-620, TIS620, TACTIS |
ISO-8869-13: | ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7 |
ISO-8859-14: | ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8 |
ISO-8859-15: | ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998 |
ISO-8859-16: | ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000 |
ISO-8859-2: | CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2 |
ISO-8859-3: | CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3 |
ISO-8859-4: | CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4 |
ISO-8859-5: | CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988 |
ISO-8859-6: | ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987 |
ISO-8859-7: | CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987 |
ISO-8859-8: | CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988 |
ISO-8859-9: | CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5 |
armscii-8: | ARMSCII-8, ARMSCII8 |
big5: | BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5 |
cp1250: | CP1250, MS-EE, WINDOWS-1250 |
cp1251: | CP1251, MS-CYRL, WINDOWS-1251 |
cp1252: | CP1252, MS-ANSI, WINDOWS-1252 |
cp1253: | CP1253, MS-GREEK, WINDOWS-1253 |
cp1254: | CP1254, MS-TURK, WINDOWS-1254 |
cp1255: | CP1255, MS-HEBR, WINDOWS-1255 |
cp1256: | CP1256, MS-ARAB, WINDOWS-1256 |
cp1257: | CP1257, WINBALTRIM, WINDOWS-1257 |
cp1258: | CP1258, WINDOWS-1258 |
cp437: | 437, CP437, IBM437 |
cp850: | 850, CP850, CSPC850MULTILINGUAL, IBM850 |
cp852: | 852, CP852, IBM852 |
cp855: | 855, CP855, IBM855 |
cp857: | 857, CP857, IBM857 |
cp860: | 860, CP860, IBM860 |
cp861: | 861, CP861, IBM861 |
cp862: | 862, CP862, IBM862 |
cp863: | 863, CP863, IBM863 |
cp864: | 864, CP864, IBM864 |
cp865: | 865, CP865, IBM865 |
cp866: | 866, CP866, CSIBM866, IBM866 |
cp869: | 869, CP869, IBM869, CP874, WINDOWS-874 |
EUC-JP: | CSEUCJP, EUC-JP, EUCJP, UJIS, X-EUC-JP |
EUC-KR: | CSEUCKR, EUC-KR, EUCKR |
GB2312: | CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58 |
koi8-r: | CSKOI8R, KOI8-R, KOI8R |
KOI8-u | KOI8-U, KOI8U |
shift-JIS: | CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS |
cp367: | ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII |
UTF8: | UTF-8, UTF8 |
viscii: | CSVISCII, VISCII, VISCII1.1-1 |
MacCyrillic: | MACCYRILLIC, X-MAC-CYRILLIC |
MacRoman: | MACROMAN, MACINTOSH, CSMACINTOSH, MAC |
MacCentralEurope: | MACCENTRALEUROPE, MACCE |
The indexer detects document charsets in this order:
"Content-type: text/html; charset=xxx"
<META NAME="Content-Type" CONTENT="text/html; charset=xxx"> (for HTML documents) or
<?xml version="1.0" encoding="xxx"?> (for XML documents)
The selection of this variant may be switched off by using the: GuesserUseMeta no command in your indexer.conf.
The defaults to "Charset" settings of the corresponding Server or Realm command.
Since 3.2.0, mnoGoSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using the "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/mnogosearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well.
To build your own language map use mguesser utility. In addition, your need to collect files with language samples in the desired charset. For new language maps creation, use the following command:
mguesser -p -c charset -l language < FILENAME > language.charset.lm
You can also use mguesser utility to guess document's language and charset by using existing language maps. To do this, use following command:
mguesser [-n maxhits] < FILENAME
For some languages, you may use several different charsets. To convert from one charset supported by mnoGoSearch to another, use mconv utility.
mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile
By default, both mguesser and mconv utilities are installed into the /usr/local/mnogosearch/sbin/ directory.
Since version 3.2.14, mnoGoSearch has an ability to update language and charset maps automatically while indexing, if the remote server supplies pages with exactly specified language and charset. To enable this function, specify command
LangMapUpdate yesin your indexer.conf file.
Use the RemoteCharset indexer.conf command to choose the default character set of indexed servers.
You can set the default language for Servers by using the DefaultLang indexer.conf command. This is useful for further restricting search results language.