Utrans is a file conversion program. It can convert a text file created with ANY 8-bit encoding, such as one of the various ISO-8859 standards, into UTF-8, the international Unicode encoding, which is the official standard of the Internet.
It can be used as a self-standing program, or as a filter (pipe). To create it under Unix, simply cd to its directory and type make. However, you will need utf-8.h in your include path, and libutf-8.so in your lib path, or in /usr/local/lib. If you do not have these files, you can obtain them from www.whizkidtech.net/i18n/.
If you are using a Unix system which does not support the so shared files, edit the libutf-8 Makefile to create libutf-8.a instead, and the utrans Makefile to link with libutf-8.a instead of libutf-8.so.
utrans [-v] [-p pagefile] [-i inputfile] [-o outputfile] -v: verbose mode on. Default: off. -p pagefile: The pagefile contains a translation table from some 8-bit encoding to Unicode (16 or 31-bit) mapping. Default: the file listed by environmental variable UTRANS. If no such variable exists and no pagefile is specified, utrans assumes the input file uses ISO-8859-1 encoding. -i inputfile: The 8-bit text file to convert. Default: stdin. -o outputfile: The UTF-8 output file. Default: stdout. -b: The pagefile is binary. This is the default. -t: The pagefile is plain text. Default: binary. -e: Ignore environment. Default: ${UTRANS} = pagefile; ${CHARMAPS} = path to the pagefile. Example: If UTRANS = ISO-8859-2, and CHARMAPS = /usr/local/utrans/charmaps, then pagefile = /usr/local/utrans/charmaps/ISO-8859-2. But, if additionally `-p CP1250', then pagefile = /usr/local/utrans/charmaps/CP1250.
If a binary pagefile is used (default), the file must contain exactly 256 unsigned integers representing the required mapping. The size and byte order of an unsigned integer is system specific. Use mbm (make binary map) to create it from a text file (in the same format as described below).
This package comes with sample maps (binary pagefiles) created on the 32-bit Intel platform.
I just did a foreach command on the /compat/linux/usr/share/i18n/charmaps/ directory with mbm, but did not examine each of the source files. If any of them was not in the format expected by mbm, it was probably not converted properly. However, I did examine randomly selected files, and they all were in the right format. Nevertheless, I do not guarantee they are all correct. Anyway, mbm is enclosed... Just do:
mbm < textmap > binarymap
Note that mbm is a very simple program which reads from stdin and writes to stdout. You will probably use it only once, if even that, so there was not much sense adding a nicer interface to it...
The text pagefile contains pairs of hexadecimal values representing the 8-bit character mapping of the input file to the respective Unicode values.
Utrans understands two different formats of the text page file:
Each Unicode value must be preceded by U+, again with no intervening space.
There may be other text inside the file. Utrans will ignore it.
I have chosen this format because a number of such files are available from Roman Czybora’s Alphabet Soup. Simply find your encoding, and click on TXT to download the file.
Here are several sample lines from his iso8859-2.txt:
=2F U+002F SOLIDUS =30 U+0030 DIGIT ZERO =31 U+0031 DIGIT ONE
These lines actually do nothing as they are the same as the default. You may delete them safely, but you do not have to.
=A1 U+0104 LATIN CAPITAL LETTER A WITH OGONEK =A2 U+02D8 BREVE =A3 U+0141 LATIN CAPITAL LETTER L WITH STROKE
These lines, on the other hand, redefine the default and are, therefore, quite necessary (let’s see if your browser can display them: Ą ˘ Ł).
The descriptions (i.e., SOLIDUS, DIGIT ZERO...) are ignored. Again, you may delete them, but you do not have to.
A number of files in this format are at ftp://dkuug.dk/i18n/charmaps/ and on many Linux installations in /usr/share/i18n/charmap/ directory. FreeBSD installs them to the /compat/linux/usr/share/i18n/charmaps/ directory.
Examples (from the file named HEBREW found on the ftp site):
<//> /x2F <U002F> SOLIDUS <0> /x30 <U0030> DIGIT ZERO <1> /x31 <U0031> DIGIT ONE <=2> /xDF <U2017> DOUBLE LOW LINE <A+> /xE0 <U05D0> HEBREW LETTER ALEF <B+> /xE1 <U05D1> HEBREW LETTER BET
Now let’s really see what your browser can do: ☺ ‗ א ב. By the way, your browser should have displayed a smiling face (U+263A) after the colon.
You can freely mix and match the two formats.
You may use utrans on different input files created with different 8-bit mappings and concatenate the result using Unix cat(1) command, or you can just append the result of each conversion to the output file using >>.
utrans -i iso1file -o output.html utrans -p iso8859-2.txt -i iso2file >> output.html utrans -p cp1252.txt -i winfile | tuc >> output.html
You may convert DOS (or Windows) input files to Unix output files (or the other way) using tuc, which you can download from ftp.whizkidtech.net/unix/tuc/ (both, Unix and Windows versions).