
Collation Customization
ICU uses UCA as a default starting point for ordering. Not all languages have sorting sequences that correspond with the UCA because UCA cannot simultaneously encompass the specifics of all the languages currently in use.
Therefore, ICU provides a data-driven, flexible, and run-time customizable mechanism called "tailoring". Tailoring overrides the default order of code points and the values of the ICU Collation Service attributes.
Collation Rule
A tailoring is a set of rules. Each rule contains a string of ordered characters that starts with an anchor point or a reset value.
The reset value is an absolute point that determines the order of other characters. For example, "&a < g", places "g" after "a" and the "a" does not change place. This rule has the following sorting consequences:
Without rule | With rule |
---|---|
apple Abernathy bird Boston green Graham |
apple Abernathy green bird Boston Graham |
Note that only the word that starts with "g" has changed place. All the words sorted after "a" and "A" are sorted after "g".
This is a non-complex example of a tailoring rule. Tailoring rules consist of zero or more rules and zero or more options. There must be at least one rule or at least one option. The rule syntax is discussed in more detail in the following sections.
Note that the tailoring rules override the UCA ordering. In addition, if a character is reordered, it automatically reorders any other equivalent characters. For example, if the rule "&e<a" is used to reorder "a" in the list, "á" is also greater than "é".
Syntax
The following table summarizes the basic syntax necessary for most usages:
Symbol | Example | Description |
---|---|---|
< | a < b | Identifies a primary (base letter) difference between "a" and "b" |
<< | a << ä | Signifies a secondary (accent) difference between "a" and "ä" |
<<< | a<<<A | Identifies a tertiary difference between "a" and "A" |
= | x = y | Signifies no difference between "x" and "y". |
& | &Z | Instructs ICU to reset at this letter. These rules will be relative to this letter from here on, but will not affect the position of Z itself. |
![]() | In releases prior to 1.8, ICU uses the notations ';' to represent secondary relations and ',' to represent tertiary relations. Starting in release 1.8, use '<<' symbols to represent secondary relations and '<<<' symbols to represent tertiary relation. Rules that use the ';' and ',' notations are still processed by ICU for compatibility; also, some of the data used for tailoring to particular locales has not yet been updated to the new syntax. However, one should consider these symbols deprecated. |
Escaping Rules
Most of the characters can be used as parts of rules. However, whitespace characters will be skipped over, and all ASCII characters that are not digits or letters are considered to be part of syntax. In order to use these characters in rules, they need to be escaped. Escaping can be done in several ways:
-
Single characters can be escaped using backslash \ (U+005C).
-
Strings can be escaped by putting them between single quotes 'like this'.
-
Single quote can be quoted using two single quotes ''.
The following examples are other tailorings:
Serbian (Latin) or Croatian: & C < č <<< Č < ć <<< Ć
This rule is needed because UCA usually considers accents to have secondary differences in order to base character. This ensures that 'ć' 'č' are treated as base letters.
UCA | Tailoring: & C < č <<< Č < ć <<< Ć |
---|---|
CUKIĆ RADOJICA ČUKIĆ SLOBODAN CUKIĆ SVETOZAR ČUKIĆ ZORAN CURIĆ MILOŠ ĆURIĆ MILOŠ CVRKALJ ÐURO |
CUKIĆ RADOJICA CUKIĆ SVETOZAR CURIĆ MILOŠ CVRKALJ ÐURO ČUKIĆ SLOBODAN ČUKIĆ ZORAN ĆURIĆ MILOŠ |
Serbian (Latin) or Croatian: & Ð < dž <<< Dž <<< DŽ
This rule is an example of a contraction. "D" alone is sorted after "C" and "Ž" is sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single letter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter after "D" in the UCA). Another thing to note in this example is capitalization of the letter "DŽ". There are three versions, since all three can legally appear in text. The fourth version "dŽ" is omitted since it does not occur.
UCA |
Tailoring: & Ð < dž <<< Dž <<< DŽ |
---|---|
dan dubok džabe džin Džin DŽIN đak Evropa |
dan dubok đak džabe džin Džin DŽIN Evropa |
Danish: &V <<< w <<< W
The letter 'W' is sorted after 'V', but is treated as a tertiary difference similar to the difference between 'v' and 'V'.
UCA | &V <<< w <<< W |
---|---|
va Va VA vb Vb VB vz Vz VZ wa Wa WA wb Wb WB wz Wz WZ |
va Va VA wa Wa WA vb Vb VB wb Wb WB vz Vz VZ wz Wz WZ |
Default Options
The tailoring inherits all the attribute values from the UCA unless they are explicitly redefined in the tailoring. The following table summarizes the option settings. UCA default options are in emphasis.
Option | Example | Description |
---|---|---|
alternate | [alternate non-ignorable] [alternate shifted] | Sets the default value of the UCOL_ALTERNATE_HANDLING attribute. If set to shifted, variable code points will be ignored on the primary level. |
backwards | [backwards 2] | Sets the default value of the UCOL_FRENCH_COLLATION attribute. If set to on, secondary level will be reversed. |
variable top | & X < [variable top] | Sets the default value for the variable top. All the code points with primary strengths less than variable top will be considered variable. |
normalization |
[normalization off] [normalization on] |
Turns on or off the UCOL_NORMALIZATION_MODE attribute. If set to on, a quick check and neccessary normalization will be performed. |
caseLevel |
[caseLevel off] [caseLevel on] |
Turns on or off the UCOL_CASE_LEVEL attribute. If set to on a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on. |
caseFirst |
[caseFirst off] [caseFirst upper] [caseFirst lower] |
Sets the value for the UCOL_CASE_FIRST attribute. If set to upper, causes upper case to sort before lower case. If set to lower, lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels. |
strength |
[strength 1] [strength 2] [strength 3] [strength 4] [strength I] |
Sets the default strength for the colator. |
hiraganaQ |
[hiraganaQ off] [hiraganaQ on] |
Controls special treatment of Hiragana code points on quaternary level. If turned on, Hiragana codepoints will get lower values than all the other non-variable code points. Strength must be greater or equal than quaternary if you want this attribute to take effect |
A tailoring that consists only of options is also valid tailoring and has the same basic ordering as the UCA. The options that modify this tailoring are described in the following examples:
The Greek tailoring has option settings only : [normalization on]
The Latvian tailoring reorders uppercase and lowercase and uses backward French ordering:
[casefirst upper] [backwards 2] & C < č , Č & G < ģ , Ģ & I < y, Y & K < ķ , Ķ & L < ļ , Ļ & N < ņ , Ņ & S < š , Š & Z < ž , Ž |
Advanced Syntactical Elements
Several other syntactical elements are needed in more specific situations. These elements are summarized in the following table:
Element | Example | Description |
---|---|---|
[before 1|2|3] | &[before 1] a<?<à<?<á | Enables users to order characters before a given character. In UCA 3.0, the example is equivalent to & ?<?<à<?<á ( ?= \u3029, Hangzhou numeral nine) * and makes accented 'a' letters sort before 'a'. Accents are often used to indicate the intonations in Pinyin. In this case, the non-accented letters sort after the accented letters. |
/ | æ/e |
Expansion. Add the collation element for 'e' to the collation element for æ. After a reset "&ae << æ" is equivalent to "&a << æ/e." See the example below . |
| | a|b | Prefix processing. If 'b' is encountered and it follows 'a', output the appropriate collation element. If 'b' follows any other letter, output the normal collation element for 'b'. Collation element for 'a' is not affected. This element is used to speed up sorting under JIS X 4061. See the example below . |
[top] | &[top] < a < b < c … | Deprecated, use indirect positioning instead Reorders a set of characters 'above' the UCA. [top] is a virtual code point having the biggest primary weight value that will ever be assigned in the UCA. Above top, there is a large number of unassigned primary weights that can be used for a 'large' tailoring, such as the reordering of the CJK characters according to a Far Eastern code page. The first difference after the top is always primary. |
![]() | The first base character (primary difference) in UCA occurs after the Hangzhou numeric 9. |
Indirect Positioning of Collation Elements
Since version 2.0 ICU allows for indirect positioning of collation elements. Similar to the option top, these options allow for positioning of the tailoring relative to significant sections of the UCA table. You can use [before] option to position before these sections.
Name | Current CE value | Note |
---|---|---|
first tertiary ignorable | [,,] | Start of the UCA table. This value will never change unless CEs are extended with higher level values |
last tertiary ignorable | [,,] | This value will never change unless CEs are extended with higher level values |
first secondary ignorable | [,, 05] | Currently there are no secondary ignorable in the UCA table. |
last secondary ignorable | [,, 05] | Currently there are no secondary ignorable in the UCA table. |
first primary ignorable | [, 87, 05] | Current code point is ̲ (U+0332). |
last primary ignorable | [, E1 B1, 05] | Currently this value points to a non-existing code point, used to facilitate sorting of compatibility characters. |
first variable | [05 07, 05, 05] | Current code point is U+0009. This is the start of the variable section. These are characters that will be ignored on primary level when shifted option is on. |
last variable | [17 9B, 05, 05] | End of variable section. |
first regular | [1A 20, 05, 05] | Current code point is ː (U+02D0). This is the first regular code point. The majority of code points are regular. |
last regular | [78 AA B2, 05, 05] | Current code point is (U+10425). Use instead of [top]. This will effectively position your tailoring between regular code points and CJK ideographs and unassigned code points. If you want to rearange a large number of codepoints (rearranging CJKs for example), this is a right place to reset to. |
first implicit | [E0 03 03, 05, 05] | Section of implicitly generated collation elements. CJK ideographs and unassigned code points get implicit values. |
last implicit | [E3 DC 70 C0, 05, 05] | End of implicit section. |
first trailing | [E5, 05, 05] | Start of trailing section. This section is reserved for future, most probably for non starting Jamos. |
last trailing | [E5, 05, 05] | End of trailing collation elements section. Tailoring that starts here is guaranteed to sort after all other non-tailored code points. |
Not all of indirect positioning anchors are useful. Most of the 'first' elements should be used with the [before] directive, in order to make sure that your tailoring will sort before an interesting section.
Following are several fragments of real tailorings, illustrating some of the advanced syntactical elements:
Expansion Example:
French: & A << æ/e <<< Æ/E
Letter 'Æ' is treated as a separate letter between 'A' and 'B'. However, the French language requires 'Æ' to be treated as a combination of letters 'A' and 'E' and to sort as an accent variation of this combination. This is an example of an expansion.
UCA | &A << æ/e <<< Æ/E |
---|---|
aa Aa AA ab Ab AB ae Ae AE az Az AZ æ Æ |
Aa Aa AA ab Ab AB ae Ae æ AE Æ az Az AZ |
Prefix Example:
Prefixes are used in Japanese tailoring to reduce the number of contractions. A big number of contractions is a performance burden, as their processing is much more complicated than the processing of regular elements. Prefixes should be used only to replace contractions followed by expansions and only if the expansion part is less frequent than the start of the contraction.
&[before 3]ァ <<< ァ|ー = ァ|ー = ぁ|ー |
This could have been written as a series of contractions followed by expansion:
&[before 3]ァー <<< ァー = ァー = ぁー |
However, in that case ァ, ァ and ぁ would be treated as contractions. Since the prolonged sound mark (ー) occurs much less frequently than the other letters of Japanese Katakana and Hiragana, it is much more prudent to put the extra processing on it by using prefixes.
Example:
"Reset" always use only the base character as the insertion point even if there is an expansion. So the following rule,
& J <<< K / B & K <<< M |
is equivalent to
& J <<< K / B <<< M |
Which produces the following sort order:
"JA"
"MA"
"KA"
"KC"
"JC"
"MC"
![]() | Assuming the letters "J", "K" and "M" have equal primary weights, the second letter contains the differences among these strings. However, the letter "K" is treated as if it always has a letter "B" following it while the letters "J" and "M" do not. |
The following is the collation elements for these strings with the specified rules:
Strings | Collation Elements | ||
---|---|---|---|
"JA" |
[005C.00.01] |
[0052.00.01] |
|
"MA" |
[005C.00.03] |
[0052.00.01] |
|
"KA" |
[005C.00.02] |
[0053.00.01] |
[0052.00.01] |
"KC" |
[005C.00.02] |
[0053.00.01] |
[0054.00.01] |
"JC" |
[005C.00.01] |
[0054.00.01] |
|
"MC" |
[005C.00.03] |
[0054.00.01] |
Tailoring Issues
ICU uses canonical closure. This means that for each code point in Unicode, if the canonically composed form of a tailored string produces different collation elements than the canonically decomposed form, then the canonically composed form is effectively added to the ordering. If 'a' is tailored, for example, all of the accented 'a' characters are also tailored. Canonical closure allows collators to process Unicode strings in the FCD form as well as in NFD.
However, compatibility equivalents are NOT automatically added. If the rule "&b < a" is in tailoring, and the order of ⓐ (circled a) is important, it should be explicitly tailored.
Redundant tailoring rules are removed, with later rules "winning". The strengths around the removed rules are also fixed.
Example:
The following table summarizes effects of different redundant rules.
Original | Equivalent | |
---|---|---|
1. |
& a < b < c < d & r < c |
& a < b < d & r < c |
2. |
& a < b < c < d & c < m |
& a < b < c < m < d |
3. |
& a < b < c < d & a < m |
& a < m < b < c < d |
4. |
& a <<< b << c < d & a < m |
& a <<< b << c < m < d |
5. |
& a < b < c < d & [before 1] c < m |
& a < b < m < c < d |
6. |
& a < b <<< c << d <<< e & [before 3] e <<< x |
& a < b <<< c << d <<< x <<< e |
7. |
& a < b <<< c << d <<< e & [before 2] e <<< x |
& a < b <<< c <<< x << d <<< e |
8. |
& a < b <<< c << d <<< e & [before 1] e <<< x |
& a <<< x < b <<< c << d <<< e |
9. |
& a < b <<< c << d <<< e <<< f < g & [before 1] g < x |
& a < b <<< c << d <<< e <<< f < x < g |
If two different reset lists use the same character it is removed from the first one (see 1 in the table above). If the second character is a reset, the second list is inserted in the first (see 2). If both are resets, then the same thing happens (see 3). Whenever such an insertion occurs, the second strength "postpones" the position (see 4).
If there is a "[before N]" on the reset, then the reset character is effectively replaced by the item that would be before it, either in a previous tailoring (if the letter occurs in one - see 5) or in the UCA. The N determines the 'distance' before, based on the strength of the difference (see 6-8). However, this is subject to postponement (see 9), so be careful!
Reset semantics
The reset semantic in ICU 1.8 is different from the previous ICU releases. Prior to version 1.8, the reset relation modifier was applicable only to the entry immediately following the reset entry. Also, the relation modifier applied to all entries that occurred until the next reset or primary relation.
For example, was equivalent to
Starting with ICU version 1.8, the modifier is equivalent to,
The new semantic produces more intuitive results, especially when the character after the reset is decomposable. Since all rules are converted to NFD before they are interpreted, this can result in contractions that the rule-writer might not be aware of. Expansion propagates only until the next reset or primary relation occurs.
For example, with the following rule: was equivalent to the following prior to ICU 1.8 and in Java,
Starting with 1.8, it is equivalent to,
& a = c / b <<< d / b << e / b <<< f / b < g <<< h
Known Limitations
The following are known limitations of the ICU collation implementation. These are theoretical limitations, however, since there are no known languages for which these limitations are an issue. However, for completeness they should be fixed in a future version after 1.8.1. The examples given are designed for simplicity in testing, and do not match any real languages.
Expansion
The goal of expansion is to sort as if the expansion text were inserted right after the character. For example, with the rule
The text "...c..." should sort as if it were right after "...ae..." with a tertiary difference. There are a few cases where this is not currently true.
Recursive Expansion
Given the rules
Expansion should sort the text "...c..." as if it were just after "...ae...", and that should also sort as if it were just after "...agi...". This requires that the compilation of expansions be recursive (and check for loops as well!). ICU currently does not do this.
Rules | Desired Order | Current Order |
---|---|---|
& a = b / c & d = c / e |
add b adf |
b add adf |
Contractions Spanning Expansions
ICU currently always pre-compiles the expansion into an internal format (a list of one or more collation elements) when the rule is compiled. If there is contraction that spanned the end of the expanded text and the start of the original text, however, that contraction will not match. A text case that illustrates this is:
Rules | Desired Order | Current Order |
---|---|---|
& a <<< c / e & g <<< eh |
ad c af g ch h |
ad c ch af g h |
Since the pre-compiled expansions are a huge performance gain, we will probably keep the implementation the way it is, but in the future allow additional syntax to indicate those few expansions that need to behave as if the text were inserted because of the existence of another contraction. Note that such expansions need to be recursively expanded (as in #1), but rather than at pre-compile time, these need to be done at runtime.
While it is possible to automatically detect these cases, it would be better to allow explicit control in case spanning is not desired. An example of such syntax might be something like:
Notes: ICU does handle the case where there is a contraction that is completely inside the expansion.
Suppose that someone had the rules:
These do not cause c to sort as if it were ae, nor should they.
Normalization
The goal of normalization is to have all text sort as if it were first normalized (converted into NFD). For performance reasons, the rules are pre-processed so there is no need to perform normalization on strings that are already in the FCD format. The vast majority of strings are in FCD.
Nulls in Contractions
Nulls should not be used in contractions that could invoke normalization.
Rules | Desired Order | Current Order |
---|---|---|
& a <<< '\u0000'^ |
a '\u0000'^ |
'\u0000'^ a |
Contractions Spanning Normalization
The following rule specifies that a grave accent followed by a b is a contraction, and sorts as if it were an e.
On this basis, "...àb..." should sort as if it were just after "...ae...". Because of the preprocessing, however, the contraction will not match if this text is represented with the pre-composed character à, but will match if given the decomposed sequence a + grave accent. The same thing happens if the contraction spans the start of a normalized sequence.
Rules | Desired Order | Current Order |
---|---|---|
& e <<< ` b |
à ad àb af |
à àb ad af |
& g <<< ca |
f ca cà h |
cà f ca h |
Variable Top
ICU lets you set the top of the variable range. This can be done, for example, to allow you to ignore just SPACES, and not punctuation.
Variable Top Exclusion
There is currently a limitation that causes variable top to (perhaps) exclude more characters than it should. This happens if you not only set variable top, but also tailor a number of characters around it with primary differences. The exact number that you can tailor depends on the internal "gaps" between the characters in the pre-compiled UCA table. Normally there is a gap of one. There are larger gaps between scripts (such as between Latin and Greek), and after certain other special characters. For example, if variable top is set to be at SPACE ('\u0020'), then it works correctly with up to 70 characters also tailored after space. However, if variable top is set to be equal to HYPHEN ('\u2010'), only one other value can be accommodated.
Rules |
Desired Order SHIFTED = ON |
Current Order | Comment |
---|---|---|---|
& \u2010 < x < [variable top] < z |
- z zb a b -b xb c |
- z zb xb a b -b c |
The goal is for x to be ignored and z not to be ignored. |
![]() | With ICU 1.8.1, the user is advised not to tailor the variable top to customize more than two primary relations (for example, "& x < y < [variable top]). Starting in ICU 2.0, a new API will be added to allow the user to set the variable top programmatically to a legal single character or a valid contracting sequence. In addition, the string that variable top is set to should not be treated as either inclusive or exclusive in the rules. |
Case Level/First/Second
In ICU, it is possible to override the tertiary settings programmatically. This is used to change the default case behavior to be all upper first or all lower first. It can also be used for a separate case level, or to ignore all other tertiary differences (such as between circled and non-circled letters, or between half-width and full-width katakana). The case values are derived directly from the Unicode character properties, and not set by the rules.
Mixed Case Contractions
There is currently a limitation that all contractions of multiple characters can only have three special case values: upper, lower, and mixed. All mixed-case contractions are grouped together, and are not affected by the upper first vs. lower first flag.
Rules |
Desired Order UPPER_FIRST |
Current Order |
---|---|---|
& c < ch <<< cH <<< Ch <<< CH |
C CH Ch cH ch |
c CH cH Ch ch |
Cautions
The following are not known rule limitations, but rather cautions.
Resets
Since resets always work on the existing state, the user is required to make sure that the rule entries are in the proper order.
Rules | Order | Comment |
---|---|---|
& a < b & a < c |
a c b |
The rules mean: put b after a, then put c after a (inserting before the b. |
Postpone Insertion
When using a reset to insert a value X with a certain strength difference after a value Y, it actually is inserted just before the next item of the same strength or higher following Y. Thus, the following are equivalent:
![]() | that this is different than the Java semantics. In Java, the value is inserted immediately after the reset character. |
Jamo Tailoring
If Jamo characters are tailored, that causes the code to go through a slow path, which will have a significant effect on performance.
Compatibility Decompositions
When tailoring a letter, the customization affects all of its canonical equivalents. That is, if tailoring rule sorts an 'a' after'e ', for example, then ""à", "á", ... are also sorted after 'e'.his is not true for compatibility equivalents. If the desired sorting order is for a superscript-a ("ª") to be after "e", it is necessary to specify the rule for that.
Case Differences
Similarly, when tailoring an "a" to be sorted after "e", including "A" to be after "e" as well, it is required to have a specific rule for that sorting sequence.
Automatic Expansions
ICU will automatically form expansions whenever a reset is to a multi-character value that is not a contraction. For example, & ab <<< c is equivalent to & a <<< c / b. The user may be unaware of this hapening, since it may not be obvious that the reset is to a multi-character value. For example, & à<<< d is equivalent to & a <<< d / `
Copyright (c) 2000 - 2004 IBM and Others - PDF Version - Feedback: icu-issues@oss.software.ibm.com
User Guide for ICU v3.2 Generated 2004-11-22.