
Normalization
Overview
Normalization is used to convert text to a unique, equivalent form. Systems can normalize Unicode-encoded text to one particular sequence, such as normalizing composite character sequences into pre-composed characters.
Normalizer allows for easier sorting and searching of text. Normalizer supports the standard normalization forms and are described in great detail in Unicode Technical Report #15 (Unicode Normalization Forms) and Section 5.7 of the Unicode Standard.
Usage
Normalizer transforms text into the canonical composed and decomposed forms. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents.
Normalizer adds one optional behavior, IGNORE_HANGUL, that differs from the standard Unicode Normalization Forms in not normalizing Korean syllables. This option can be passed to the Normalizer constructors} and to the static compose and decompose methods. This option will be turned off by default.
There are three common usage models for Normalizer:
You can use normalize() to process an entire input string at once.
For example, if you have a string in Unicode that you want to convert to a Latin 1 character set, ISO-8859-1: "a´bc" is normalized to "ábc".
You can create a Normalizer object and use it to iterate through the normalized form of a string by calling first() and next().
For example, when you are comparing two strings you want to stop the comparison as soon as a significant difference is found. This way, you do not have the overhead of converting an entire string if only the first characters are important.
You can use setIndex() and getIndex() to perform a random-access iteration.
For example, when you want to do a fast language sensitive searching, such as Boyer-Moore.
Transformation Methods
normalize()
Normalizes a string using the given normalization operation.compose()
Composes a string forming the separate Unicode characters into their corresponding user characters.decompose()
Decomposes a string into its separate Unicode characters.
Movement Methods
Return characters:
current()
Return the current character in the normalized text.first()
Return the first character in the normalized text.last()
Return the last character in the normalized text.next()
Return the next character in the normalized text and advance the iteration position by one.previous()
Return the previous character in the normalized text and decrement the iteration position by one.setIndex
Set the iteration position in the input text that is being normalized and return the first normalized character at that position.
Return character index values:
endIndex()
Retrieve the index of the end of the input text.getIndex()
Retrieve the current iteration position in the input text that is being normalized.startIndex()
Retrieve the index of the start of the input text.
![]() | Normalizer objects behave like iterators and have methods such as setIndex(), next(), previous(), etc. You should note that while the setIndex() and getIndex() refer to indices in the underlying Unicode input text, the next() and previous() methods iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next() and previous() and the indices passed to and returned from setIndex() and getIndex(). It is for this reason that Normalizer does not implement the CharacterIterator interface. |
Programming Examples in C and C++
Programming example for normalizing a string .
Copyright (c) 2000 - 2006 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html
User Guide for ICU v3.6 Generated 2006-08-31.