DataparkSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are rather large that leads to increase of the executable files size. See configure parameters to enable support for these charsets.
DataparkSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.
Table 7-1. Language groups
Language group | Character sets |
Arabic | cp864, ISO-8859-6, MacArabic, windows-1256 |
Armenian | armscii-8 |
Baltic | cp775, ISO-8859-13, ISO-8859-4, windows-1257 |
Celtic | ISO-8859-14 |
Central European | cp852, ISO-8859-16, ISO-8859-2, MacCE, MacCroatian, MacRomania, windows-1250 |
Chinese Simplified | GB2312, GBK |
Chinese Traditional | Big5, Big5-HKSCS, cp950, GB-18030 |
Cyrillic | cp855, cp866, cp866u, ISO-8859-5, KOI-7, KOI8-R, KOI8-U, MacCyrillic, windows-1251 |
Georgian | geostd8 |
Greek | cp869, cp875, ISO-8859-7, MacGreek, windows-1253 |
Hebrew | cp862, ISO-8859-8, MacHebrew, windows-1255 |
Icelandic | cp861, MacIceland |
Indian | MacGujarati, tscii |
Iranian | ISIRI3342 |
Japanese | EUC-JP, ISO-2022-JP, Shift_JIS |
Korean | EUC-KR |
Lao | cp1133 |
Nordic | cp865, ISO-8859-10 |
South Eur | ISO-8859-3 |
Tajik | KOI8-T |
Thai | cp874, ISO-8859-11, MacThai |
Turkish | cp1026, cp857, ISO-8859-9, MacTurkish, windows-1254 |
Unicode | sys-int, UTF-16BE, UTF-16LE, UTF-8 |
Vietnamese | VISCII, windows-1258 |
Western | cp437, cp500, cp850, cp860, cp863, IBM037, ISO-8859-1, ISO-8859-15, MacRoman, US-ASCII, windows-1252 |
Each charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:
Table 7-2. Charsets aliases
armscii-8 | armscii-8, armscii8 |
Big5 | big-5, big-five, big5, bigfive, cn-big5, csbig5 |
Big5-HKSCS | big5-hkscs, big5_hkscs, big5hk, hkscs |
cp1026 | 1026, cp-1026, cp1026, ibm1026 |
cp1133 | 1133, cp-1133, cp1133, ibm1133 |
cp437 | 437, cp437, ibm437 |
cp500 | 500, cp500, ibm500 |
cp775 | 775, cp775, ibm775 |
cp850 | 850, cp850, cspc850multilingual, ibm850 |
cp852 | 852, cp852, ibm852 |
cp855 | 855, cp855, ibm855 |
cp857 | 857, cp857, ibm857 |
cp860 | 860, cp860, ibm860 |
cp861 | 861, cp861, ibm861 |
cp862 | 862, cp862, ibm862 |
cp863 | 863, cp863, ibm863 |
cp864 | 864, cp864, ibm864 |
cp865 | 865, cp865, ibm865 |
cp866 | 866, cp866, csibm866, ibm866 |
cp866u | 866u, cp866u |
cp869 | 869, cp869, csibm869, ibm869 |
cp874 | 874, cp874, cs874, ibm874, windows-874 |
cp875 | 875, cp875, ibm875, windows-875 |
cp950 | 950, cp950, windows-950 |
EUC-JP | cseucjp, euc-jp, euc_jp, eucjp, ujis, x-euc-jp |
EUC-KR | cseuckr, euc-kr, euc_kr, euckr |
GB-18030 | gb-18030, gb18030 |
GB2312 | chinese, cn-gb, csgb2312, csiso58gb231280, euc-cn, euc_cn, euccn, gb2312, gb_2312-80, iso-ir-58 |
GBK | cp936, gbk, windows-936 |
geostd8 | geo8-gov, geostd8 |
IBM037 | 037, cp037, csibm037, ibm037 |
ISIRI3342 | isiri-3342, isiri3342 |
ISO-2022-JP | csiso2022jp, iso-2022-jp |
ISO-8859-1 | cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin1 |
ISO-8859-10 | csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso_8859-10, iso_8859-10:1992, l6, latin6 |
ISO-8859-11 | iso-8859-11, iso8859-11, iso_8859-11, iso_8859-11:1992, tactis, thai, tis-620, tis620 |
ISO-8859-13 | iso-8859-13, iso-ir-179, iso8859-13, iso_8859-13, l7, latin7 |
ISO-8859-14 | iso-8859-14, iso-ir-199, iso8859-14, iso_8859-14, iso_8859-14:1998, l8, latin8 |
ISO-8859-15 | iso-8859-15, iso-ir-203, iso8859-15, iso_8859-15, iso_8859-15:1998, l9, latin0, latin9 |
ISO-8859-16 | iso-8859-16, iso-ir-226, iso8859-16, iso_8859-16, iso_8859-16:2000 |
ISO-8859-2 | csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2 |
ISO-8859-3 | csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso_8859-3, iso_8859-3:1988, l3, latin3 |
ISO-8859-4 | csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso_8859-4, iso_8859-4:1988, l4, latin4 |
ISO-8859-5 | csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso_8859-5, iso_8859-5:1988 |
ISO-8859-6 | arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso_8859-6, iso_8859-6:1987 |
ISO-8859-7 | csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso_8859-7, iso_8859-7:1987 |
ISO-8859-8 | csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso_8859-8, iso_8859-8:1988 |
ISO-8859-9 | csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso_8859-9, iso_8859-9:1989, l5, latin5 |
KOI-7 | iso-ir-37, koi-7, koi7 |
KOI8-R | cskoi8r, koi8-r, koi8r |
KOI8-T | koi8-t, koi8t |
KOI8-U | koi8-u, koi8u |
MacArabic | macarabic |
MacCE | cmac, macce, maccentraleurope, x-mac-ce |
MacCroatian | maccroation |
MacCyrillic | maccyrillic, x-mac-cyrillic |
MacGreek | macgreek |
MacGujarati | macgujarati |
MacHebrew | machebrew |
MacIceland | macisland |
MacRoman | csmacintosh, mac, macintosh, macroman |
MacRomania | macromania |
MacThai | macthai |
MacTurkish | macturkish |
Shift_JIS | csshiftjis, ms_kanji, s-jis, shift-jis, shift_jis, sjis, x-sjis |
sys-int | sys-int |
tscii | tscii |
US-ASCII | ansi_x3.4-1968, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii |
UTF-16BE | utf-16, utf-16be, utf16, utf16be |
UTF-16LE | utf-16le, utf16le |
UTF-8 | utf-8, utf8 |
VISCII | csviscii, viscii, viscii1.1-1 |
windows-1250 | cp-1250, cp1250, ms-ee, windows-1250 |
windows-1251 | cp-1251, cp1251, ms-cyr, ms-cyrl, win-1251, win1251, windows-1251 |
windows-1252 | cp-1252, cp1252, ms-ansi, windows-1252 |
windows-1253 | cp-1253, cp1253, ms-greek, windows-1253 |
windows-1254 | cp-1254, cp1254, ms-turk, windows-1254 |
windows-1255 | cp-1255, cp1255, ms-hebr, windows-1255 |
windows-1256 | cp-1256, cp1256, ms-arab, windows-1256 |
windows-1257 | cp-1257, cp1257, winbaltrim, windows-1257 |
windows-1258 | cp-1258, cp1258, windows-1258 |
indexer recodes all documents to the character set specified in the LocalCharset command in your indexer.conf file. Internally recoding is implemented using Unicode. Please note that if some recoding can't convert a character directly from one charset to another, DataparkSearch will use HTML numeric character references to escape this character (i.e. in form &#NNN; where NNN - a character code in Unicode). Thus, for any LocalCharset you do not lost any information about indexed documents, but on LocalCharset selection depend the database volume you will get after indexing.
You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset, DataparkSearch will recode all data automaticaly.
indexer detects document character set in this order:
DataparkSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/dpsearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well.
To build your own language map use dpguesser utility. In addition, your need to collect file with language samples in charset desired. For new language map creation, use the following command:
dpguesser -p -c charset -l language < FILENAME > language.charset.lm
You can also use dpguesser utility for guessing document's language and charset by existing language maps. To do this, use following command:
dpguesser [-n maxhits] < FILENAME
For some languages, it may be used few different charset. To convert from one charset supported by DataparkSearch to another, use dpconv utility.
dpconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfileYou may also specyfy -e switch for dpconv to use HTML escape entities for input, and -E switch - for output.
By default, both dpguesser and dpconv utilities is installed into /usr/local/dpsearch/sbin/ directory.
DataparkSearch can update language and charset maps automatically while indexing, if remote server is supply exactly specified language and charset with pages. To enable this function, specify the following command in your indexer.conf file:
LangMapUpdate yes
By default, DataparkSearch uses only first 8192 bytes of each file indexed to detect language and charset. You may change this value using GuesserBytes command. Use value of 0 to use all text from document indexed.
GuesserBytes 16384
Use RemoteCharset command in indexer.conf to choose the default charset of indexed servers.
You can set default language for Servers by using DefaultLang indexer.conf variable. This is useful while restricting search by URL language.
You may display search results in any charset supported by DataparkSearch. Use BrowserCharset command in search.htm to select charset for search results. This charset may be different from LocalCharset specified. All recodings will done automatically.