Normalizer
transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text.
More...
#include <normlzr.h>
Public Members | |||
![]() | ![]() | enum | { COMPAT_BIT, DECOMP_BIT, COMPOSE_BIT } |
![]() | ![]() | enum | { DONE } |
![]() | ![]() | If DONE is returned, then there are no more normalization results available. More... | |
![]() | ![]() | enum | EMode { NO_OP, COMPOSE, COMPOSE_COMPAT, DECOMP, DECOMP_COMPAT } |
![]() | ![]() | The mode of a Normalizer object. More... | |
![]() | ![]() | enum | { IGNORE_HANGUL } |
![]() | ![]() | The options for a Normalizer object. More... | |
![]() | ![]() | Normalizer (const UnicodeString& str, EMode mode) | |
![]() | ![]() | Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
![]() | ![]() | Normalizer (const UnicodeString& str, EMode mode, int32_t opt) | |
![]() | ![]() | Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
![]() | ![]() | Normalizer (const UChar* str, int32_t length, EMode mode) | |
![]() | ![]() | Creates a new Normalizer object for iterating over the normalized form of a given UChar string. More... | |
![]() | ![]() | Normalizer (const UChar* str, int32_t length, EMode mode, int32_t option) | |
![]() | ![]() | Creates a new Normalizer object for iterating over the normalized form of a given UChar string. More... | |
![]() | ![]() | Normalizer (const CharacterIterator& iter, EMode mode) | |
![]() | ![]() | Creates a new Normalizer object for iterating over the normalized form of the given text. More... | |
![]() | ![]() | Normalizer (const CharacterIterator& iter, EMode mode, int32_t opt) | |
![]() | ![]() | Creates a new Normalizer object for iterating over the normalized form of the given text. More... | |
![]() | ![]() | Normalizer (const Normalizer& copy) | |
![]() | ![]() | Copy constructor. More... | |
![]() | ![]() | ~Normalizer () | |
![]() | ![]() | Destructor. More... | |
![]() | ![]() | UChar32 | current (void) const |
![]() | ![]() | Return the current character in the normalized text. More... | |
![]() | ![]() | UChar32 | first (void) |
![]() | ![]() | Return the first character in the normalized text. More... | |
![]() | ![]() | UChar32 | last (void) |
![]() | ![]() | Return the last character in the normalized text. More... | |
![]() | ![]() | UChar32 | next (void) |
![]() | ![]() | Return the next character in the normalized text and advance the iteration position by one. More... | |
![]() | ![]() | UChar32 | previous (void) |
![]() | ![]() | Return the previous character in the normalized text and decrement the iteration position by one. More... | |
![]() | ![]() | UChar32 | setIndex (UTextOffset index) |
![]() | ![]() | Set the iteration position in the input text that is being normalized and return the first normalized character at that position. More... | |
![]() | ![]() | void | reset (void) |
![]() | ![]() | Reset the iterator so that it is in the same state that it was just after it was constructed. More... | |
![]() | ![]() | UTextOffset | getIndex (void) const |
![]() | ![]() | Retrieve the current iteration position in the input text that is being normalized. More... | |
![]() | ![]() | UTextOffset | startIndex (void) const |
![]() | ![]() | Retrieve the index of the start of the input text. More... | |
![]() | ![]() | UTextOffset | endIndex (void) const |
![]() | ![]() | Retrieve the index of the end of the input text. More... | |
![]() | ![]() | UBool | operator== (const Normalizer& that) const |
![]() | ![]() | Returns true when both iterators refer to the same character in the same character-storage object. More... | |
![]() | ![]() | UBool | operator!= (const Normalizer& that) const |
![]() | ![]() | Normalizer* | clone (void) const |
![]() | ![]() | Returns a pointer to a new Normalizer that is a clone of this one. More... | |
![]() | ![]() | int32_t | hashCode (void) const |
![]() | ![]() | Generates a hash code for this iterator. More... | |
![]() | ![]() | void | setMode (EMode newMode) |
![]() | ![]() | Set the normalization mode for this object. More... | |
![]() | ![]() | EMode | getMode (void) const |
![]() | ![]() | Return the basic operation performed by this Normalizer . More... | |
![]() | ![]() | void | setOption (int32_t option, UBool value) |
![]() | ![]() | Set options that affect this Normalizer 's operation. More... | |
![]() | ![]() | UBool | getOption (int32_t option) const |
![]() | ![]() | Determine whether an option is turned on or off. More... | |
![]() | ![]() | void | setText (const UnicodeString& newText, UErrorCode &status) |
![]() | ![]() | Set the input text over which this Normalizer will iterate. More... | |
![]() | ![]() | void | setText (const CharacterIterator& newText, UErrorCode &status) |
![]() | ![]() | Set the input text over which this Normalizer will iterate. More... | |
![]() | ![]() | void | setText (const UChar* newText, int32_t length, UErrorCode &status) |
![]() | ![]() | Set the input text over which this Normalizer will iterate. More... | |
![]() | ![]() | void | getText (UnicodeString& result) |
![]() | ![]() | Copies the text under iteration into the UnicodeString referred to by "result". More... | |
![]() | ![]() | const UChar* | getText (int32_t& count) |
![]() | ![]() | Returns the text under iteration into the UChar* buffer pointer. More... | |
Static Public Members | |||
![]() | ![]() | void | normalize (const UnicodeString& source, EMode mode, int32_t options, UnicodeString& result, UErrorCode &status) |
![]() | ![]() | Normalizes a String using the given normalization operation. More... | |
![]() | ![]() | void | compose (const UnicodeString& source, UBool compat, int32_t options, UnicodeString& result, UErrorCode &status) |
![]() | ![]() | Compose a String . More... | |
![]() | ![]() | void | decompose (const UnicodeString& source, UBool compat, int32_t options, UnicodeString& result, UErrorCode &status) |
![]() | ![]() | Static method to decompose a String . More... | |
Friends | |||
![]() | ![]() | class | ComposedCharIter |
Normalizer
transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text.
Normalizer
supports the standard normalization forms described in
Unicode Technical Report #15.
Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character "Á" (A-acute). In Unicode, this can be encoded as a single character (the "composed" form):
00C1 LATIN CAPITAL LETTER A WITH ACUTE
0041 LATIN CAPITAL LETTER A 0301 COMBINING ACUTE ACCENT
To a user of your program, however, both of these sequences should be treated as the same "user-level" character "Á". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
0066 LATIN SMALL LETTER F 0066 LATIN SMALL LETTER F 0069 LATIN SMALL LETTER I
FB03 LATIN SMALL LIGATURE FFI
The ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.
Normalizer
helps solve these problems by transforming text into the canonical composed and decomposed forms as shown in the first example above. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents. Finally, Normalizer
rearranges accents into the proper canonical order, so that you do not have to worry about accent rearrangement on your own.
Normalizer
adds one optional behavior, #IGNORE_HANGUL, that differs from the standard Unicode Normalization Forms. This option can be passed to the #Normalizer and to the static #compose and #decompose methods. This option, and any that are added in the future, will be turned off by default.
There are three common usage models for Normalizer
. In the first, the static #normalize method is used to process an entire input string at once. Second, you can create a Normalizer
object and use it to iterate through the normalized form of a string by calling #first and #next. Finally, you can use the #setIndex and #getIndex methods to perform random-access iteration, which is very useful for searching.
Note: Normalizer
objects behave like iterators and have methods such as setIndex
, next
, previous
, etc. You should note that while the setIndex
and getIndex
refer to indices in the underlying input text being processed, the next
and previous
methods it iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next
and previous
and the indices passed to and returned from setIndex
and getIndex
. It is for this reason that Normalizer
does not implement the CharacterIterator interface.
Note: Normalizer
is currently based on version 2.1.8 of the Unicode Standard. It will be updated as later versions of Unicode are released. If you are using this class on a JDK that supports an earlier version of Unicode, it is possible that Normalizer
may generate composed or dedecomposed characters for which your JDK's java.lang.Character class does not have any data.
Definition at line 106 of file normlzr.h.
enum Normalizer::EMode |
The mode of a Normalizer object.
NO_OP |
Null operation for use with the.
#Normalizer and the static #normalize method. This value tells the
|
COMPOSE |
Canonical decomposition followed by canonical composition.
Used with the #Normalizer and the static #normalize method to determine the operation to be performed. If all optional features (e.g. #IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form C.
|
COMPOSE_COMPAT |
Compatibility decomposition followed by canonical composition.
Used with the #Normalizer and the static #normalize method to determine the operation to be performed. If all optional features (e.g. #IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form KC.
|
DECOMP |
Canonical decomposition.
This value is passed to the #Normalizer and the static #normalize method to determine the operation to be performed. If all optional features (e.g. #IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form D.
|
DECOMP_COMPAT |
Compatibility decomposition.
This value is passed to the #Normalizer and the static #normalize method to determine the operation to be performed. If all optional features (e.g. #IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form KD.
|
Normalizer::COMPAT_BIT |
Normalizer::DECOMP_BIT |
Normalizer::COMPOSE_BIT |
Normalizer::DONE |
Normalizer::IGNORE_HANGUL |
Option to disable Hangul/Jamo composition and decomposition.
This option applies to Korean text, which can be represented either in the Jamo alphabet or in Hangul characters, which are really just two or three Jamo combined into one visual glyph. Since Jamo takes up more storage space than Hangul, applications that process only Hangul text may wish to turn this option on when decomposing text.
The Unicode standard treates Hangul to Jamo conversion as a canonical decomposition, so this option must be turned off if you wish to transform strings into one of the standard Unicode Normalization Forms.
Normalizer::Normalizer (const UnicodeString & str, EMode mode) |
Creates a new Normalizer
object for iterating over the normalized form of a given string.
str |
The string to be normalized. The normalization will start at the beginning of the string.
|
mode | The normalization mode. |
Normalizer::Normalizer (const UnicodeString & str, EMode mode, int32_t opt) |
Creates a new Normalizer
object for iterating over the normalized form of a given string.
The options
parameter specifies which optional Normalizer
features are to be enabled for this object.
str |
The string to be normalized. The normalization will start at the beginning of the string.
|
mode |
The normalization mode.
|
opt | Any optional features to be enabled. Currently the only available option is #IGNORE_HANGUL If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument |
Normalizer::Normalizer (const UChar * str, int32_t length, EMode mode) |
Creates a new Normalizer
object for iterating over the normalized form of a given UChar string.
str |
The string to be normalized. The normalization will start at the beginning of the string.
|
length | Lenght of the string |
mode | The normalization mode. |
Normalizer::Normalizer (const UChar * str, int32_t length, EMode mode, int32_t option) |
Creates a new Normalizer
object for iterating over the normalized form of a given UChar string.
str |
The string to be normalized. The normalization will start at the beginning of the string.
|
length | Lenght of the string |
mode | The normalization mode. |
opt | Any optional features to be enabled. Currently the only available option is #IGNORE_HANGUL If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument |
Normalizer::Normalizer (const CharacterIterator & iter, EMode mode) |
Creates a new Normalizer
object for iterating over the normalized form of the given text.
iter |
The input text to be normalized. The normalization will start at the beginning of the string.
|
mode | The normalization mode. |
Normalizer::Normalizer (const CharacterIterator & iter, EMode mode, int32_t opt) |
Creates a new Normalizer
object for iterating over the normalized form of the given text.
iter |
The input text to be normalized. The normalization will start at the beginning of the string.
|
mode |
The normalization mode.
|
opt | Any optional features to be enabled. Currently the only available option is #IGNORE_HANGUL If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument |
Normalizer::Normalizer (const Normalizer & copy) |
Copy constructor.
Normalizer::~Normalizer () |
Destructor.
UChar32 Normalizer::current (void) const |
Return the current character in the normalized text.
UChar32 Normalizer::first (void) |
Return the first character in the normalized text.
This resets the Normalizer's
position to the beginning of the text.
UChar32 Normalizer::last (void) |
Return the last character in the normalized text.
This resets the Normalizer's
position to be just before the the input text corresponding to that normalized character.
UChar32 Normalizer::next (void) |
Return the next character in the normalized text and advance the iteration position by one.
If the end of the text has already been reached, #DONE is returned.
UChar32 Normalizer::previous (void) |
Return the previous character in the normalized text and decrement the iteration position by one.
If the beginning of the text has already been reached, #DONE is returned.
UChar32 Normalizer::setIndex (UTextOffset index) |
Set the iteration position in the input text that is being normalized and return the first normalized character at that position.
Note: This method sets the position in the input text, while #next and #previous iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next
and previous
and the indices passed to and returned from setIndex
and #getIndex.
index |
the desired index in the input text.
|
void Normalizer::reset (void) |
Reset the iterator so that it is in the same state that it was just after it was constructed.
A subsequent call to next
will return the first character in the normalized text. In contrast, calling setIndex(0)
followed by next
will return the second character in the normalized text, because setIndex
itself returns the first character
UTextOffset Normalizer::getIndex (void) const |
Retrieve the current iteration position in the input text that is being normalized.
This method is useful in applications such as searching, where you need to be able to determine the position in the input text that corresponds to a given normalized output character.
Note: This method sets the position in the input, while #next and #previous iterate through characters in the output. This means that there is not necessarily a one-to-one correspondence between characters returned by next
and previous
and the indices passed to and returned from setIndex
and #getIndex.
UTextOffset Normalizer::startIndex (void) const |
Retrieve the index of the start of the input text.
This is the begin index of the CharacterIterator
or the start (i.e. 0) of the String
over which this Normalizer
is iterating
UTextOffset Normalizer::endIndex (void) const |
Retrieve the index of the end of the input text.
This is the end index of the CharacterIterator
or the length of the String
over which this Normalizer
is iterating
UBool Normalizer::operator== (const Normalizer & that) const |
Returns true when both iterators refer to the same character in the same character-storage object.
UBool Normalizer::operator!= (const Normalizer & other) const [inline]
|
Normalizer * Normalizer::clone (void) const |
Returns a pointer to a new Normalizer that is a clone of this one.
The caller is responsible for deleting the new clone.
int32_t Normalizer::hashCode (void) const |
Generates a hash code for this iterator.
void Normalizer::setMode (EMode newMode) |
Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating over a string, calls to #next and #previous may return previously buffers characters in the old normalization mode until the iteration is able to re-sync at the next base character. It is safest to call #setText, #first, #last, etc. after calling setMode
.
newMode |
the new mode for this Normalizer . The supported modes are:
|
EMode Normalizer::getMode (void) const |
void Normalizer::setOption (int32_t option, UBool value) |
Set options that affect this Normalizer
's operation.
Options do not change the basic composition or decomposition operation that is being performed , but they control whether certain optional portions of the operation are done. Currently the only available option is:
option | the option whose value is to be set. |
value |
the new setting for the option. Use true to turn the option on and false to turn it off.
|
UBool Normalizer::getOption (int32_t option) const |
void Normalizer::setText (const UnicodeString & newText, UErrorCode & status) |
Set the input text over which this Normalizer
will iterate.
The iteration position is set to the beginning.
void Normalizer::setText (const CharacterIterator & newText, UErrorCode & status) |
Set the input text over which this Normalizer
will iterate.
The iteration position is set to the beginning.
void Normalizer::setText (const UChar * newText, int32_t length, UErrorCode & status) |
Set the input text over which this Normalizer
will iterate.
The iteration position is set to the beginning.
void Normalizer::getText (UnicodeString & result) |
Copies the text under iteration into the UnicodeString referred to by "result".
result | Receives a copy of the text under iteration. |
const UChar * Normalizer::getText (int32_t & count) |
Returns the text under iteration into the UChar* buffer pointer.
result | Receives a copy of the text under iteration. |
void Normalizer::normalize (const UnicodeString & source, EMode mode, int32_t options, UnicodeString & result, UErrorCode & status) [static]
|
Normalizes a String
using the given normalization operation.
The options
parameter specifies which optional Normalizer
features are to be enabled for this operation. Currently the only available option is #IGNORE_HANGUL. If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument.
source |
the input string to be normalized.
|
aMode |
the normalization mode
|
options |
the optional features to be enabled.
|
result |
The normalized string (on output).
|
status | The error code. |
void Normalizer::compose (const UnicodeString & source, UBool compat, int32_t options, UnicodeString & result, UErrorCode & status) [static]
|
Compose a String
.
The options
parameter specifies which optional Normalizer
features are to be enabled for this operation. Currently the only available option is #IGNORE_HANGUL. If you want the default behavior corresponding to Unicode Normalization Form C or KC, use 0 for this argument.
source |
the string to be composed.
|
compat |
Perform compatibility decomposition before composition. If this argument is false , only canonical decomposition will be performed.
|
options |
the optional features to be enabled.
|
result |
The composed string (on output).
|
status | The error code. |
void Normalizer::decompose (const UnicodeString & source, UBool compat, int32_t options, UnicodeString & result, UErrorCode & status) [static]
|
Static method to decompose a String
.
The options
parameter specifies which optional Normalizer
features are to be enabled for this operation. Currently the only available option is #IGNORE_HANGUL. The desired options should be OR'ed together to determine the value of this argument. If you want the default behavior corresponding to Unicode Normalization Form D or KD, use 0 for this argument.
str |
the string to be decomposed.
|
compat |
Perform compatibility decomposition. If this argument is false , only canonical decomposition will be performed.
|
options |
the optional features to be enabled.
|
result |
The composed string (on output).
|
status |
The error code.
|
friend class ComposedCharIter [friend]
|