
ICU Architectural Design
This chapter discusses the ICU design structure, the ICU versioning support, and the introduction of namespace in C++.
Java and ICU Basic Design Structure
The JDK internationalization components and ICU components both share the same common basic architectures with regard to the following:
locales
data-driven services
ICU threading models and the open and close model
cloning customization
error handling
extensibility
resource bundle inheritance model
There are design features in ICU4C that are not in the Java Development Kit (JDK) due to programming language restrictions. These features include the following:
Locales
Locale IDs are composed of language, country, and variant information. The following links provide additional useful information regarding ISO standards: ISO-639 , and an ISO Country Code, ISO-3166 . For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
Data-driven Services
Data-driven services often use resource bundles for locale data. These services map a key to data. The resources are designed not only to manage system locale information but also to manage application-specific or general services data. ICU supports string, numeric, and binary data types and can be structured into nested arrays and tables.
This results in the following:
Data used by the services can be built at compile time or run time.
For efficient loading, system data is pre-compiled to .dll files or files that can be mapped into memory.
Data for services can be added and modified without source code changes.
ICU Threading Model and Open and Close Model
The "open and close" model supports multi-threading. It enables ICU users to use the same kind of service for different locales, either in the same thread or in different threads.
For example, a thread can open many collators for different languages, and different threads can use different collators for the same locale simultaneously. Constant data can be shared so that only the current state is allocated for each editor.
The ICU threading model is designed to avoid contention for resources, and enable you to use the services for multiple locales simultaneously within the same thread. The ICU threading model, like the rest of the ICU architecture, is the same model used for the international services in Java™.
When you use a service such as collation, the client opens the service using an ID, typically a locale. This service allocates a small chunk of memory used for the state of the service, with pointers to shared, read-only data in support of that service. (In Java or C++, you call getInstance() to create an object. ICU uses the open and close metaphor in C because it is more familiar to C programmers.)
If no locale is supplied when a service is opened, ICU uses the default locale. Once a service is open, changing the default locale has no effect. Thus, there can not be any thread synchronization between the default locales and open services.
When you open a second service for the same locale, another small chunk of memory is used for the state of the service, with pointers to the same shared, read-only data. Thus, the majority of the memory usage is shared. When any service is closed, then the chunk of memory is deallocated. Other connections that point to the same shared data stay valid.
Any number of services, for the same locale or different locales, can be open within the same thread or in different threads. However, you cannot use a reference to an open service in two threads at the same time. An individual open service is not thread-safe. Rather, you must use the clone function to create a copy of the service you want and then pass this copy to the second thread. This procedure allows you to use the same service in different threads, but avoids any thread synchronization or deadlock problems.
Clone operations are designed to be much faster than reopening the service with initial parameters and copying the source's state. (With objects in C++ and Java, the clone function is also much safer than trying to recreate a service, since you get the proper subclass.) Once a service is cloned, changes will not affect the original source service, or vice-versa.
Thus, the normal mode of operation is to:
Open a service with a given locale.
Use the service as long as needed. However, do not keep opening and closing a service within a tight loop.
Clone a service if it needs to be used in parallel in another thread.
Close any clones that you open as well as any instances of the services that are owned.
![]() | These service instances may be closed in any sequence. The preceding steps are given as an example. |
Cloning Customization
Typically, the services supplied with ICU cover the vast majority of usages. However, there are circumstances where the service needs to be customized for a new locale. ICU (and Java) enable you to create customized services. For example, you can create a RuleBasedCollator by merging the rules for French and Arabic to get a custom French-Arabic collation sequence. By merging these rules, the pointer does not point to a read-only table that is shared between threads. Instead, the pointer refers to a table that is specific to your particular open service. If you clone the open service, the table is copied. When you close the service, the table is destroyed.
For some services, ICU supplies registration. You can register a customized open service under an ID; keeping a copy of that service even after you close the original. A client in that thread or in other threads can recreate a copy of the service by opening with that ID. These registrations are not persistent; once your program finishes, ICU flushes all the registrations. While you still might have multiple copies of data tables, it is faster to create a service from a registered ID than it is to create a service from rules.
![]() | To work around the lack of persistent registration, query the service for the parameters used to create it and then store those parameters in a file on a disk. |
For services whose IDs are locales, such as collation, the registered IDs must also be locales. For those services (like Transliteration or Timezones) that are cross-locale, the IDs can be any string.
Prospective future enhancements for this model are:
Having custom services share data tables, by making those tables reference counted. This will reduce memory consumption and speed clone operations (a performance enhancement chiefly useful for multiple threads using the same customized service).
Expanding registration for all the international services.
Allowing persistent registration of services.
ICU Memory Usage
ICU4C APIs are designed to allow separate heaps for its libraries vs. the application. This is achieved by providing functions to allocate and release objects owned by ICU4C using only ICU4C library functions. For more details see the Memory Usage section in the Coding Guidelines .
ICU Initialization and Termination
The ICU library does not normally require any explicit initialization prior to use. An application begins use simply by calling any ICU API in the usual way. (There is one exception to this, described below.)
In C++ programs, ICU objects and APIs may safely be used during static initialization of other application-defined classes or objects. There are no order-of-initialization problems between ICU and static objects from other libraries because ICU does not rely on C++ static object initialization for its normal operation.
When an application is terminating, it may optionally call the function u_cleanup(void) , which will free any heap storage that has been allocated and held by the ICU library. The main benefit of u_cleanup() occurs when using memory leak checking tools while debugging or testing an application. Without u_cleanup(), memory being held by the ICU library will be reported as leaks.
Initializing ICU in Multithreaded Environments
There is one specialized case where extra care is needed to safely initialize ICU. This situation will arise only when ALL of the following conditions occur:
The application main program is written in plain C, not C++.
The application is multithreaded, with the first use of ICU within the process possibly occurring simultaneously in more than one thread.
The application will be run on a platform that does not handle C++ static constructors from libraries when the main program is not in C++. Platforms known to exhibit this behavior are Mac OS X and HP/UX. Platforms that handle C++ libraries correctly include Windows, Linux and Solaris.
To safely initialize the ICU library when all of the above conditions apply, the application must explicitly arrange for a first-use of ICU from a single thread before the multi-threaded use of ICU begins. A convenient ICU operation for this purpose is uloc_getDefault() , declared in the header file "unicode/uloc.h".
Error Handling
In order for ICU to maximize portability, this version includes only the subset of the C++ language that compile correctly on older C++ compilers and provide a usable C interface. Thus, there is no use of the C++ exception mechanism in the code or Application Programming Interface (API).
To communicate errors reliably and support multi-threading, this version uses an error code parameter mechanism. Every function that can fail takes an error-code parameter by reference. This parameter is always the last parameter listed for the function.
The UErrorCode parameter is defined as an enumerated type. Zero represents no error, positive values represent errors, and negative values represent non-error status codes. Macros (U_SUCCESS and U_FAILURE) are provided to check the error code.
The UErrorCode parameter is an input-output function. Every function tests the error code before performing any other task and immediately exits if it produces a FAILURE error code. If the function fails later on, it sets the error code appropriately and exits without performing any other work, except for any cleanup it needs to do. If the function encounters a non-error condition that it wants to signal, such as "encountered an unmapped character" in conversion, the function sets the error code appropriately and continues. Otherwise, the function leaves the error code unchanged.
Generally, only the functions that do not take a UErrorCode parameter, but call functions that do, must declare a variable. Almost all functions that take a UErrorCode parameter, and also call other functions that do, merely have to propagate the error code that they were passed to the functions they call. Functions that declare a new UErrorCode parameter must initialize it to U_ZERO_ERROR before calling any other functions.
ICU enables you to call several functions (that take error codes) successively without having to check the error code after each function. Each function usually must check the error code before doing any other processing, since it is supposed to stop immediately after receiving an error code. Propagating the error-code parameter down the call chain saves the programmer from having to declare the parameter in every instance and also mimics the C++ exception protocol more closely.
Extensibility
There are 3 major extensibility elements in ICU:
Data Extensibility
The user installs new locales or conversion data to enhance the existing ICU support. For more details, refer to the package tool chapter for more information.Code Extensibility
The classes, data, and design are fully extensible. Examples of this extensibility include the BreakIterator , RuleBasedBreakIterator and DictionaryBasedBreakIterator classes.Error Handling Extensibility
There are mechanisms available to enhance the built-in error handling when it is necessary. For example, you can design and create your own conversion callback functions when an error occurs. Refer to the Conversion chapter callback section for more information.
Resource Bundle Inheritance Model
A resource bundle is a set of <key,value> pairs that provide a mapping from key to value. A given program can have different sets of resource bundles; one set for error messages, one for menus, and so on. However, the program may be organized to combine all of its resource bundles into a single related set.
The set is organized into a tree with "root" at the top, the language at the first level, the country at the second level, and additional variants below these levels. The set must contain a root that has all keys that can be used by the program accessing the resource bundles.
Except for the root, each resource bundle has an immediate parent. For example, if there is a resource bundle "X_Y_Z", then there must be the resource bundles: "X_Y", and "X". Each child resource bundle can omit any <key,value> pair that is identical to its parent's pair. (Such omission is strongly encouraged as it reduces data size and maintenance effort). It must override any <key,value> pair that is different from its parent's pair. If you have a resource bundle for the locale ID "language_country_variant", you must also have a bundle for the ID "language_country" and one for the ID "language."
If a program doesn't find a key in a child resource bundle, it can be assumed that it has the same key as the parent. The default locale has no effect on this. The particular language used for the root is commonly English, but it depends on the developer's preference. Ideally, the language should contain values that minimize the need for its children to override it.
The default locale is used only when there is not a resource bundle for a given language. For example, there may not be an Italian resource bundle. (This is very different than the case where there is an Italian resource bundle that is missing a particular key.) When a resource bundle is missing, ICU uses the parent unless that parent is the root. The root is an exception because the root language may be completely different than its children. In this case, ICU uses a modified lookup and the default locale. The following are different lookup methods available:
Lookup chain : Searching for a resource bundle.
en_US_some-variant
en_US
en
defaultLang_defaultCountry
defaultLang
root
Lookup chain : Searching for a <key, value> pair after en_US_some-variant has ben loaded. ICU does not use the default locale in this case.
en_US_some-variant
en_US
en
root
Other ICU Design Principles
ICU supports extensive version code and data changes and introduces namespace usage.
Version Numbers in ICU
Version changes show clients when parts of ICU change. ICU; its components (such as Collator); each resource bundle, including all the locale data resource bundles; and individual tagged items within a resource bundle, have their own version numbers. Version numbers numerically and lexically increase as changes are made. All version numbers are used in Application Programming Interfaces (APIs) with a UVersionInfo structure. The UVersionInfo structure is an array of four unsigned bytes. These bytes are:
0: Major version number
1: Minor version number
2: Milli version number
3: Micro version number
Two UVersionInfo structures may be compared using binary comparison (memcmp) to see which is larger or newer. Version numbers may be different for different services. For instance, do not compare the ICU library version number to the ICU collator version number.
UVersionNumber structures can be converted to and from string representations as dotted integers (such as "1.4.5.0") using the u_versionToString() and u_versionFromString() functions. String representations may omit trailing zeroes.
The interpretation of version numbers depends on what is being described.
ICU Release Version Number
For ICU releases and the library (code) versions, a change in the major version number indicates releases that may have feature additions or may break binary compatibility. A change in minor (and possibly micro) version numbers indicates a maintenance release that is binary compatible. The micro version number is typically used between releases.
ICU is released as "reference" and "enhancement" releases. The main effect is that reference releases tend to be ported more widely and tested more thoroughly. Even minor version numbers are used for reference releases (like ICU 1.6) and odd minor version numbers are used for enhancement releases (like ICU 1.7).
Resource Bundles and Elements
The data stored in resource bundles is tagged with version numbers. A resource bundle can contain a tagged string named "Version" that declares the version number in dotted-integer format. For example,
en { Version { "1.0.3.5" } ... } |
A resource bundle may omit the "version" element and thus, will inherit a version along the usual chain. For example, if the resource bundle en_US contained no "version" element, it would inherit "1.0.3.5" from the parent en element. If inheritance passes all the way to the root resource bundle and it contains no "version" resource, then the resource bundle receives the default version number 0.
Elements within a resource bundle may also contain version numbers. For example:
be { CollationElements { Version { "1.0.0.0" } ... } } |
In this example, the CollationElements data is version 1.0.0.0. This element version is not related to the version of the bundle.
Internal version numbers
Internally, data files carry format and other version numbers. These version numbers ensure that ICU can use the data file. The interpretation depends entirely on the data file type. Often, the major number in the format version stays the same for backwards-compatible changes to a data file format. The minor format version number is incremented for additions that do not violate the backwards compatibility of the data file.
Component Version Numbers
ICU component version numbers may be found using:
u_getVersion() returns the version number of ICU as a whole in C++. In C, ucol_getVersion() returns the version number of ICU as a whole.
ures_getVersion() and ResourceBundle::getVersion() return the version number of a ResourceBundle. This is a data version number for the bundle as a whole and subject to inheritance.
u_getUnicodeVersion() and Unicode::getUnicodeVersion() return the version number of the Unicode character data that underlies ICU. This version reflects the numbering of the Unicode releases. See http://www.unicode.org/ for more information.
Collator::getVersion() in C++ and ucol_getVersion() in C return the version number of the Collator. This is a code version number for the collation code and algorithm. It is a combination of version numbers for the collation implementation, the Unicode Collation Algorithm data (which is the data that is used for characters that are not mentioned in a locale's specific collation elements), and the collation elements.
Configuration and Management
A major new feature in ICU 2.0 is the ability to link to different versions of ICU with the same program. Using this new feature, a program can keep using ICU 1.8 collation, for example, while using ICU 2.0 for other services. ICU now can also be unloaded if needed, to free up resources, and then reloaded when it is needed.
Namespace in C++
ICU 2.0 introduces the use of namespace to avoid naming collision between ICU exported symbols and other libraries. All the public ICU C++ classes will be appended to the "icu_MajorVersionNumber_MinorVersionNumber::" namespace variable. ICU 2.0 includes the "using namespace icu_MajorVersionNumber_MinorVersionNumber" in the public header clause so there is no need to change the user programs with this update.
API Dependencies
It is sometimes useful to see a dependency chart between the public ICU APIs and ICU libraries. This chart can be useful to people that are new to ICU or to people that want only certain ICU libraries.
Here are some things to realize about the chart.
It gives a general overview of the ICU library dependencies.
Internal dependencies, like the mutex API, are left out for clarity.
Similar APIs were lumped together for clarity (e.g. Formatting). Some of these dependency details can be viewed from the ICU API reference.
The descriptions of each API can be found in our ICU API reference
Copyright (c) 2000 - 2004 IBM and Others - PDF Version - Feedback: icu-issues@oss.software.ibm.com
User Guide for ICU v3.2 Generated 2004-11-22.