Please contact Tendy Liu (LiuYanLu@cn.ibm.com) if you find any mistake or have any comment. A summary of all non-minor changes can be found in the Change History section located at the end of this document.

(C) Copyright IBM Corp. 1997 to 2002. All rights reserved.


Code Page Considerations in Web Client/Server Interactions


Table of Contents

Introduction

Scenario 1
The web browser requests an HTML file from the server, and displays the HTML file with entry fields created by the FORM element.

Scenario 2
The user enters some data into the HTML FORM entry fields and clicks the Submit button. The browser formats the input data into a message and sends it to the web server via either the [HTTP] GET or POST method.

Using The GET or POST (ENCTYPE="application/x-www-form-urlencoded") Method

Using The POST (ENCTYPE="multipart/form-data") Method

Recommendations

Java Servlet Development Kit (JSDK) "bug"

See also the I18N/L10N web pages at the W3C web site (www.w3.org/International/Overview.html).


Introduction

The following scenario illustrates a typical client-server interaction using HTML and the web:

Source: Publish Dynamic Applications on the Web by David R. McClanahan, Databased Web Advisor, April 1997


Scenario 1

The web browser requests an HTML file from the server, and displays the HTML file with entry fields created by the FORM element.

Questions

Answers

When the web browser requests a specific HTML file from the server via HTTP 1.1 protocol (note HTTP 1.0 would not work here), during the initial content negotiation phase, the browser can tell the server which language(s) and code page(s) it can accept.

E.g.

Client  GET foo.html HTTP/1.1
Accept-Language: zh, en;q=0.5
Accept-Charset: big5, x-euc-tw;q=0.5
Accept: */*

Server 200 OK
Content-Type: text/html; charset=big5
Content-Language: zh
Content-Length: 1042
... data ...

The Accept-Language parameter tells the server what language(s) is acceptable to the browser, and the Accept-Charset parameter tells the server what code page(s) is acceptable to the browser. In the above example, the web browser asks the server to send it the HTML file called foo.htmlencoded in Chinese (although English is acceptable as an alternative). The HTML file should be encoded in the Big5 code page (although EUC-TW code page is also acceptable as an alternative). If the server cannot satisfy the browser's specified language(s) or code page(s), it should send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed. If no Accept-Language is presented, the default is that any language is acceptable. If no Accept-Charset is presented, the default is that any code page is acceptable.

Note that zh is used to denote the Chinese language, but does not specify whether it is Simplified Chinese or Traditional Chinese. This means the user is comfortable with any form of Chinese. The user could have set the browser's preferred language to zh-CN (for Simplified Chinese) or zh-TW (for Traditional Chinese).

To see the Accept-Language in action, set your browser such that English is your first preference. (In Netscape Navigator v7, you can do this via the Edit-->Preferences-->Navigator-->Languages interface.) Then clear the browser's memory and disk caches, and go to http://www.alis.com. You'll see their home page in English. Now set your browser such that French is your first preference, clear the caches again and reload the Alis home page. You'll now see their home page in French!

In the first case, Navigator is sending to www.alis.com the HTTP header

	Accept-Language: en,fr
and in the second case, Navigator is sending
	Accept-Language: fr,en

For Navigator v7.0 running on Windows XP, it generates

   Accept-Charset: UTF-8,*       

by default, but you can change this default via Edit-->Preferences-->Navigator-->Languages-->Default Character Encoding.

Internet Explorer v6.0 running on Windows XP does not generate any Accept-Charset header, even when HTTP 1.1 is asked to be used.

The server typically would store different language versions of the same HTML file, but would not store the same HTML file encoded in different code pages multiple times. See HTML Documents Coded Character Sets Guidelines for a list of recommended code pages used to encode the HTML file, depending on the language. Suppose the Accept-Charset specifies code page X but the HTML file is encoded in code page Y, the server should perform a code page conversion of the HTML file from Y to X on the fly prior to sending it back to the requesting browser.

According to the HTML 4 specification, the web browsers (user agents in HTML terminology) must NEVER assume any default encoding code page, and should use the following algorithms (in decreasing priority order) to determine the encoding code page of the HTML file:

  1. The browser already knew for sure the code page used to encode the HTML file.

  2. If HTTP 1.1 (RFC 2616) protocol was used, the server would respond to the encoding code page of the HTML file in the charset parameter of the Content-Type header.

    E.g.

    Server  200 OK
    Content-Type: text/html; charset=big5
    Content-Language: zh
    Content-Length: 1042
    ... data ...

    Here is a direct quote in Section 3.4.1 of RFC 2616:

    HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document.

    How to make the server send out the appropriate charset information? See www.w3.org/International/O-HTTP-charset.html for some answers.

  3. The content author can specify the charset inside the HTML file via the META element.

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <HTML>
    <HEAD>
    <META http-equiv="Content-Type" content="text/html; charset=big5">
    <TITLE>...</TITLE>
    :
    </HTML>

    The META declaration must only be used when the character encoding is organized such that US-ASCII characters stand for themselves at least until the META element is parsed.

    Also if the server needs to convert the HTML file into a different code page prior to sending it to the browser, the server should also update the META element's charset value accordingly.

  4. The charset attribute on an HTTP element that designates an external resource.

    E.g. <A href="/zh/tw/foo.html" charset="big5">...</A>

  5. Heuristic algorithm such as those used to determine the various Japanese encodings.

  6. User definable.

    The browser respects whatever the user has selected in the Options: Document Encoding menu.

XHTML v1.0 compatible web browsers issue:

Traditionally, the character encoding of an HTML document is either specified by a web server via the charset parameter of the HTTP Content-Type header, or via a meta element in the document itself. In an XML document, the character encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="zh-cn"?> ). In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, an XHTML document that wants to set its character encoding explicitly must include both the XML declaration of an encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=zh-cn" />). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence.

Sidebar:

For a list of charset names recognized by the Netscape browser, see How to Specify Charset in HTML by Frank Yung-Fong Tang of Netscape.

For dynamic forms generated by the server (say from a CGI script or Server-Side Include function), the server can tell the browser what code page(s) it can accept via the Accept-Charset attribute of the FORM element.

E.g. <FORM Accept-Charset="big5, x-euc-tw" Type= ...>

In the above example, the server tells the browser it can only accept data encoded in either Big 5 or EUC-TW, but not both. The browser should then act accordingly to ensure all future user data sent to the server are encoded in either Big 5 or EUC-TW code page.

Note that each HTML file is limited to a single code page, thus making sure the code pages used support all the characters in the file. Unicode (UTF-8) is a good choice if you have multilingual data in the HTML file. Even if the HTML file contains only one language and script, Unicode, especially UTF-8, is still the encoding of choice because all major browsers support Unicode (UTF-8). Otherwise an American user may need to download and install Chinese Big5 support when browsing a web page encoded in Big5 for example.


Scenario 2

The user enters some data into the HTML FORM entry fields and clicks the Submit button. The browser formats the input data into a message and sends it to the web server via either the [HTTP] GETor POST method.

Using The GET or POST (ENCTYPE="application/x-www-form-urlencoded") Method

Both the GET method and the POST method with ENCTYPE="application/x-www-form-urlencoded" (default) append the user data to the URL of the application (as specified in the METHOD attribute), using a restricted subset of the 7-bit US ASCII code page. Since HTML 4 and RFC 1738 allow only ASCII characters, browsers encode non-ASCII characters using escape sequences %HH...%HH, where HH are two hexadecimal digits.

E.g. The two double-byte Kanji characters that denote "Japan" will appear in the URL-encoded string as %93%FA %96%7B when encoded using IBM PC code page 932.

Question

How does the browser let the server's application know what code page was used to encode the data, especially the %HH code points?

Answers

When an HTML file with a charset defined (via the META element) contains a FORM, the FORM data is submitted in that specified charset, however the input text fields--<INPUT TYPE=TEXT> and <TEXTAREA>--are handled by the native platform. The result can be confusing. Consider the following scenario:

On a system running ISO-8859-1, use the browser to open an HTML file with a FORM. The element in the file specifies the charset to be ISO-8859-2. The file contains the code point X'A3', which the browser properly displays as the latin capital letter L with stroke. The user enters the code point X'A3' via an input method editor (IME) to the input field, but now the pound sterling symbol appears on the screen. How is it that the same code point X'A3' appears as latin capital letter L with stroke character outside the input fields and as the pound sterling symbol inside an input field?

The reason for the above behavior is that the system is responsible for the input and output of the input text fields, and the browser is responsible for the rest of the HTML file. Thus the system interprets the X'A3' code point using ISO-8859-1 (which is the pound sterling symbol), while the browser interprets the X'A3' code point using ISO-8859-2 (which is the capital L with stroke character). Just before sending the data to the server, however, the browser will convert the input to ISO-8859-2 in this case.

Unfortunately there is no architected or standardized mechanism to communicate the code page information. Sue Williams/Somers/IBM@IBMUS has researched this problem further by looking at the specifications of HTML 3.2, HTML 4.0, and HTTP 1.1, plus Microsoft and Netscape web sites. She specifically looked for information on how the code page encoding via the FORM GET method is supposed to be handled in HTML 3.2 and HTML 4.0. She has also looked at the Netscape suggestion of coding a hidden field to specify the encoding code page for the GET method. She did not find any definitive statement as to how the browser is supposed to handle the encoding of FORM data sent back to the server. By reading between the lines and some experiments, she concluded that both Microsoft IE and Netscape Navigator encode the FORM data using the same character encoding specified in the META element of the HTML.

Upon further experimentation, the browser is found to encode the FORM data using the current active setting of character encoding. For example, if the HTML file has a meta tag that states it is encoded in ISO 8859-1, the FORM data will also be sent encoded in ISO 8859-1. If the user changes the browser's encoding to ISO 8859-2 say while viewing the HTML file, the FORM data will then be sent in ISO 8859-2. Since you would not know in advance in what language/script the user would enter the FORM data. A Chinese customer may, for example, enters his name in Chinese into the Name field, while a French customer may enter her name in French into the same Name field. This is why using UTF-8 as the HTML file encoding is a good choice, as it will preserve the integrity of all the major scripts of the world.

  1. Scenario A:

  2. Scenario B:

  3. Scenario C:

The server processes the data by calling the appropriate application, and sends the MIME results back to the browser. The invoked application can be a CGI-BIN program written in Perl say, or a Java servlet.

Since the server's active code page may be (and will be) different from the code page used to encode the browser's submitted data, some servers will automatically convert the browser's submitted data to the server's code page before giving them to the invoked server application. If the data is in the form of application/x-www-form-urlencoded, then the invoked application must decode the data in order to retrieve the original name-value pairs using the following steps:

  1. Convert the data back into the web browser's code page.
  2. Decode the data--such as search for the "&" character (which acts as the name-value pairs delimiter) and the "=" character (which separates a name from its value), and converting the + character back into space.
  3. Convert the decoded name-value pairs into the server's code page and then process the data.

See CGI Form data processing on Host environment for details and some sample C code.

In MS IE 5.0 above version under Internet Options-->Advanced, there is a line item called Always send URLs as UTF-8, and is checked by default. If the HTML file header also contains the element specifying the charset as UTF-8 or no charset defined, then the browser uses UTF-8 to encode the data. Remember no matter whether the charset is received from server, the browser always uses current setting of encoding (View->Character Coding) to send the FORM data back to server.

How IBM WebSphere Application Server (WAS) solves the problem?

WAS contains a file called bootstrap.properties located in the AppServer/properties directory, whose content is initialized by the customer webmaster. During the initialization of servlet engine, the servlet engine configures/bootstraps itself using the information in this file.

For URL-encoded FORM data, the Java servlets will need to decode the data and then convert them to Unicode. The Java Servlet Development Kit (Servlet Specification Version 2.2) from Sun always assumes ISO/IEC 8859-1 is used to encode the FORM data (see Java Servlet Development Kit "bug" below for more information). WAS uses the following algorithm to detect the encoding code page:

  1. Check the default.client.encoding entry in the bootstrap.properties file.
  2. If the entry exists, then its value denotes the encoding code page. Otherwise,
  3. Check the Accept-Charset in the HTTP protocol. If it exists, then its value denotes the encoding code page. Otherwise,
  4. Check the Accept-Language in the HTTP protocol. If it exists, then the default code page for the Accept-Language denotes the encoding code page. (WAS has an internal table that maps each language to the most popular PC code page.) Otherwise,
  5. The encoding code page is assumed to be the file.encoding value returned by the JVM.

Using The POST (ENCTYPE="multipart/form-data") Method

The POST method with ENCTYPE="multipart/form-data" is the preferred method because the value part of each name-value pair is encapsulated in the body part of a multipart MIME body, and sent as an HTTP 1.1 entity (see section 7 of RFC1867). Each body part can (and should) be labelled with an appropriate Content-Type, including a charset parameter that specifies the character encoding scheme. Every character in the HTML Document Character Set (which is ISO/IEC 10646) can be represented using this method.

E.g.

Content-Type: multipart/form-data; charset=iso-8859-1; boundary=AaB03x

----------------------------AaB03x
Content-Disposition: form-data; name="surname"

Cheng
----------------------------AaB03x
Content-Disposition: form-data; name="given-name"

Alexis
----------------------------AaB03x

Sidebar:

Netscape Navigator v7 and Microsoft IE v6 do use the charset specified in the META element of the HTML file and HTTP 1.0 to POST the FORM data to the server, but don't generate the charset parameter in the MIME header.

Recommendations

New submit method in XForm

XForms is an XML application that represents the next generation of forms for the Web. It provides a new submit method in "post" as application/xml. This format permits the expression of the instance data as XML that is straightforward to process with off-the-shelf XML processing tools. In addition, this format is capable of submission of binary content. And the encoding charset of submitted data are defined as XML declaration (e.g., <?xml version="1.0" encoding="zh-cn"?> )


Java Servlet Development Kit (JSDK) "bug"

Note: The bug described in this section occurred in Java Servlet Specification Version 2.2 or lower. The JSDK that implements Servlet Specification Version 2.3 or above has added a new method, setCharacterEncoding(...) in javax.Servlet.ServletRequest class, to address this problem.

In the October 1998 issue of The VisualAge Magazine, Patsy Yu/Toronto/IBM@IBMCA wrote an article entitled Writing internationalized servlets with VisualAge for Java, e-business edition that describes a code page problem with the Java Servlet Development Kit (JSDK) v2. The following scenario illustrates the problem:

A Japanese browser sends the following URL-encoded form data to a Java servlet on a web server:

     http://...?abc=%90%A2
where X'90A2' is the double-byte code point of a Japanese Kanji character in Shift JIS. The servlet calls the Java method, HttpServletRequest getParameters(...), which returns the String object with values X'0090' and X'00A2', instead of the Unicode equivalent X'4E16'. The reason is that the servlet has no way to know the prior code page that was used to encode the incoming data, thus it assumes the data--X'90A2' in this case--are in Latin 1 (ISO 8859-1), and of course the Unicode equivalent to Latin 1 characters is just to prefix X'00' in front of the 8859-1 code point.

The following experiment demonstrates the problem.

Software configuration:

Procedures:

  1. Run the servletrunner.exe included in JSDK 2.0.

  2. Open a browser with the URL http://cycheng:8080/servlet/TestServlet?abc=%90%A2

    where X'90A2' is the double-byte code point of a Japanese Kanji character in Shift JIS, and TestServelet.javais:

    import javax.servlet.*;
    import javax.servlet.http.*;
    import java.io.*;
    import java.util.*;

    public class TestServlet extends HttpServlet
    {
    public void doGet( HttpServletRequest req, HttpServletResponse res )
    throws IOException
    {
    Enumeration params;
    String name, value;

    res.setContentType("text/html");
    PrintWriter pw = new PrintWriter( res.getOutputStream() );

    pw.println( "<HTML><HEAD>" );
    pw.println( "<TITLE>Test Servlet</TITLE>" );
    pw.println( "<meta http-equiv=\"Content=Type\" Content=\"text/html\";charset=\"Shift-jis\">" );
    pw.println( "<BODY>" );
    pw.println( "<H1>Test Servlet</H1>" );

    pw.println( "<P>" );
    params = req.getParameterNames();
    while(params.hasMoreElements())
    {
    name = (String)params.nextElement();
    value = req.getParameter( name );

    char ca[] = new char[2];

    value.getChars( 0,2,ca,0 );

    pw.println( "(int)value = " + (int)ca[0] + " " + (int)ca[1] + "<P>" );
    pw.println( "(char)value = " + ca[0] + " " + ca[1] + "<P>" );

    String s1 = value.substring( 0,1 );
    String s2 = value.substring( 1,2 );

    if ( s1.compareTo("\u4e16") == 0 )
    pw.println( "Parameter value is: X'\u4e16' [Unicode]" );
    else
    if ( s1.compareTo("\u0090") == 0 && s2.compareTo("\u00a2") == 0 )
    pw.println( "Parameter value is: X'0090' and X'00A2' [???]" );
    else
    if ( s1.compareTo("\u90a2") == 0 )
    pw.println( "Parameter value is: X'90A2' [Shift JIS]" );
    else
    pw.println( "No match!" );

    pw.println( "<P>Parameter name is: <EM>" + name + "</EM>" );
    pw.println( "<BR>Parameter value is: <EM>" + value + "</EM>" );

    }

    pw.println("</BODY></HTML>");
    pw.flush();
    pw.close();

    } // end of doGet()

    }

    Output of the browser:

    Test Servlet

    (int)value = 144 162

    (char)value = ? ¢

    Parameter value is: X'0090' and X'00A2' [???]

    Parameter name is: abc
    Parameter value is:


Change History

When Who What
2002-10-29 Tendy Liu Added XHTML and XForm information on charset encoding.
2002-10-21 Tendy Liu Updated with testing results using Netscape v7.0 and Microsoft IE v6.0, JSDK bug has been solved since servlet specification version 2.3
2000-01-24 Alexis Cheng Updated with test results using Netscape Navigator v4.7 and Microsoft IE v5 to POST some FORM data, and added a Recommendations section.
1999-09-24 Alexis Cheng Referenced the I18N/L10N web pages at the W3C web site.
1999-08-25 Alexis Cheng Changed RFC 2068 to RFC 2616 (the latter obsoletes the former).
1999-07-14 Alexis Cheng Added information on how WebSphere solves the problem on how the browser lets the server's application know what code page to use to decode the data.
1999-01-27 Alexis Cheng Added information on Java Servlet Development Kit from Patsy Yu/Toronto/IBM@IBMCA.
1998-10-16 Alexis Cheng Added information from Sue Williams/Somers/IBM@IBMUS.
1998-09-21 Alexis Cheng Added information on Alis web site.
1998-03-11 Alexis Cheng Added quotes from RFC 2068 on the preference of HTTP 1.1 "charset" over the user's preference.
1998-03-09 Alexis Cheng Added information on the Accept-Charset request header.
1998-01-29 Alexis Cheng Clarified the priorities used by the browsers to determine the encoding code page of the HTML file.
1998-02-01 Alexis Cheng Added info on how Netscape tells the server the encoding code page of the URL-encoded data.
1998-01-26 Alexis Cheng The GET method is no longer deprecated in the official HTML 4.0 standard.
1997-12-17 Alexis Cheng Initial version.