Database encoding

IBM Director is a polylingual product that needs to be able to store and retrieve data from a database for many different locales concurrently. The current Java Data Base Connectivity (JDBC) drivers do not support this feature for DBCS languages. We have encountered JDBC drivers that would not correctly store Unicode values greater than 0xFF, and several that would not allow values greater than 0x7F. This precluded the storage of Unicode characters in the database, and even some characters converted to the locale-specific code page.

The solution to this problem as implemented in IBM Director is to encode some of the strings stored in the database. If the string to be inserted in the database has even one character that exceeds the capability of the JDBC driver, then the entire string is encoded. Checking the characters in the string against a maximum value determines whether or not to encode the string. The upper limit for AS/400 and non-1252 code page locales is 0x7F; otherwise the limit is 0xFF. If any character in the string exceeds the upper limit value then the string is converted to an encoding that is similar to UTF8. The main difference between our encoding and UTF8 is that the 8th bit is never set. Here is the table that describes the encoding.

Start character

End character

Required data bits

Binary byte sequence

(x = data bits )

\u0000

\u003F

6

00xxxxxx

\u0040

\u07FF

11

010xxxxx 01xxxxxx

\u0400

\uFFFF

16

0110xxxx 01xxxxxx 01xxxxxx

This encoding is very important to know about if a third party tool is used to create queries against the database. The strings returned might need to be decoded. Note that the majority of the information in the database can be represented in code page 1252 and therefore will not be encoded using this scheme. Only string data having characters that exceed the capability of the current JDBC drivers is encoded.