Normalizer
transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text.
More...
#include <normlzr.h>
Public Types | |
enum | { COMPAT_BIT = 1, DECOMP_BIT = 2, COMPOSE_BIT = 4 } |
enum | { DONE = 0xffff } |
If DONE is returned, then there are no more normalization results available. More... | |
enum | EMode { NO_OP = 0, COMPOSE = COMPOSE_BIT, COMPOSE_COMPAT = COMPOSE_BIT | COMPAT_BIT, DECOMP = DECOMP_BIT, DECOMP_COMPAT = DECOMP_BIT | COMPAT_BIT } |
The mode of a Normalizer object. More... | |
enum | { IGNORE_HANGUL = 0x001 } |
The options for a Normalizer object. More... | |
Public Methods | |
Normalizer (const UnicodeString& str, EMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
Normalizer (const UnicodeString& str, EMode mode, int32_t opt) | |
Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
Normalizer (const UChar* str, int32_t length, EMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of a given UChar string. More... | |
Normalizer (const UChar* str, int32_t length, EMode mode, int32_t option) | |
Creates a new Normalizer object for iterating over the normalized form of a given UChar string. More... | |
Normalizer (const CharacterIterator& iter, EMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of the given text. More... | |
Normalizer (const CharacterIterator& iter, EMode mode, int32_t opt) | |
Creates a new Normalizer object for iterating over the normalized form of the given text. More... | |
Normalizer (const Normalizer& copy) | |
Copy constructor. More... | |
~Normalizer () | |
Destructor. More... | |
UChar32 | current (void) const |
Return the current character in the normalized text. More... | |
UChar32 | first (void) |
Return the first character in the normalized text. More... | |
UChar32 | last (void) |
Return the last character in the normalized text. More... | |
UChar32 | next (void) |
Return the next character in the normalized text and advance the iteration position by one. More... | |
UChar32 | previous (void) |
Return the previous character in the normalized text and decrement the iteration position by one. More... | |
UChar32 | setIndex (UTextOffset index) |
Set the iteration position in the input text that is being normalized and return the first normalized character at that position. More... | |
void | reset (void) |
Reset the iterator so that it is in the same state that it was just after it was constructed. More... | |
UTextOffset | getIndex (void) const |
Retrieve the current iteration position in the input text that is being normalized. More... | |
UTextOffset | startIndex (void) const |
Retrieve the index of the start of the input text. More... | |
UTextOffset | endIndex (void) const |
Retrieve the index of the end of the input text. More... | |
UBool | operator== (const Normalizer& that) const |
Returns true when both iterators refer to the same character in the same character-storage object. More... | |
UBool | operator!= (const Normalizer& that) const |
Normalizer* | clone (void) const |
Returns a pointer to a new Normalizer that is a clone of this one. More... | |
int32_t | hashCode (void) const |
Generates a hash code for this iterator. More... | |
void | setMode (EMode newMode) |
Set the normalization mode for this object. More... | |
EMode | getMode (void) const |
Return the basic operation performed by this Normalizer . More... | |
void | setOption (int32_t option, UBool value) |
Set options that affect this Normalizer 's operation. More... | |
UBool | getOption (int32_t option) const |
Determine whether an option is turned on or off. More... | |
void | setText (const UnicodeString& newText, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | setText (const CharacterIterator& newText, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | setText (const UChar* newText, int32_t length, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | getText (UnicodeString& result) |
Copies the text under iteration into the UnicodeString referred to by "result". More... | |
const UChar* | getText (int32_t& count) |
Returns the text under iteration into the UChar* buffer pointer. More... | |
Static Public Methods | |
void | normalize (const UnicodeString& source, EMode mode, int32_t options, UnicodeString& result, UErrorCode &status) |
Normalizes a String using the given normalization operation. More... | |
void | compose (const UnicodeString& source, UBool compat, int32_t options, UnicodeString& result, UErrorCode &status) |
Compose a String . More... | |
void | decompose (const UnicodeString& source, UBool compat, int32_t options, UnicodeString& result, UErrorCode &status) |
Static method to decompose a String . More... | |
UNormalizationMode | getUNormalizationMode (EMode mode, UErrorCode& status) |
Converts C's Normalizer::EMode to UNormalizationMode. More... | |
EMode | getNormalizerEMode (UNormalizationMode mode, UErrorCode& status) |
Converts C++'s UNormalizationMode to Normalizer::EMode. More... | |
UNormalizationCheckResult | quickCheck (const UnicodeString& source, EMode mode, UErrorCode& status) |
Performing quick check on a string, to quickly determine if the string is in a particular normalization format. More... | |
Private Types | |
enum | { EMPTY = -1, STR_INDEX_SHIFT = 2, STR_LENGTH_MASK = 0x0003 } |
enum | { HANGUL_BASE = 0xac00, HANGUL_LIMIT = 0xd7a4, JAMO_LBASE = 0x1100, JAMO_VBASE = 0x1161, JAMO_TBASE = 0x11a7, JAMO_LCOUNT = 19, JAMO_VCOUNT = 21, JAMO_TCOUNT = 28, JAMO_NCOUNT = JAMO_VCOUNT * JAMO_TCOUNT } |
Private Methods | |
UChar | nextCompose (void) |
UChar | prevCompose (void) |
UChar | nextDecomp (void) |
UChar | prevDecomp (void) |
UChar | curForward (void) |
UChar | curBackward (void) |
void | init (CharacterIterator* iter, EMode mode, int32_t option) |
void | initBuffer (void) |
void | clearBuffer (void) |
Private Attributes | |
EMode | fMode |
int32_t | fOptions |
int16_t | minDecomp |
CharacterIterator* | text |
UnicodeString | buffer |
UTextOffset | bufferPos |
UTextOffset | bufferLimit |
UChar | currentChar |
UnicodeString | explodeBuf |
Static Private Methods | |
void | bubbleAppend (UnicodeString& target, UChar ch, uint32_t cclass) |
uint32_t | getComposeClass (UChar ch) |
uint16_t | composeLookup (UChar ch) |
uint16_t | composeAction (uint16_t baseIndex, uint16_t comIndex) |
void | explode (UnicodeString& target, uint16_t index) |
UChar | pairExplode (UnicodeString& target, uint16_t action) |
void | fixCanonical (UnicodeString& result) |
uint8_t | getClass (UChar ch) |
void | doAppend (const UChar source[], uint16_t offset, UnicodeString& dest) |
void | doInsert (const UChar source[], uint16_t offset, UnicodeString& dest, UTextOffset pos) |
uint16_t | doReplace (const UChar source[], uint16_t offset, UnicodeString& dest, UTextOffset pos) |
void | hangulToJamo (UChar ch, UnicodeString& result, uint16_t decompLimit) |
void | jamoAppend (UChar ch, uint16_t decompLimit, UnicodeString& dest) |
void | jamoToHangul (UnicodeString& buffer, UTextOffset start) |
Friends | |
class | ComposedCharIter |
Normalizer
transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text.
Normalizer
supports the standard normalization forms described in
Unicode Technical Report #15.
Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character "Á" (A-acute). In Unicode, this can be encoded as a single character (the "composed" form):
or as two separate characters (the "decomposed" form):00C1 LATIN CAPITAL LETTER A WITH ACUTE</PRE>
0041 LATIN CAPITAL LETTER A 0301 COMBINING ACUTE ACCENT</PRE>
To a user of your program, however, both of these sequences should be treated as the same "user-level" character "Á". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
or as the single character0066 LATIN SMALL LETTER F 0066 LATIN SMALL LETTER F 0069 LATIN SMALL LETTER I</PRE>
FB03 LATIN SMALL LIGATURE FFI</PRE>
The ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.
Normalizer
helps solve these problems by transforming text into the
canonical composed and decomposed forms as shown in the first example above.
In addition, you can have it perform compatibility decompositions so that
you can treat compatibility characters the same as their equivalents.
Finally, Normalizer
rearranges accents into the proper canonical
order, so that you do not have to worry about accent rearrangement on your
own.
Normalizer
adds one optional behavior, #IGNORE_HANGUL,
that differs from
the standard Unicode Normalization Forms. This option can be passed
to the #Normalizer and to the static
#compose and #decompose methods. This
option, and any that are added in the future, will be turned off by default.
There are three common usage models for Normalizer
. In the first,
the static #normalize method is used to process an
entire input string at once. Second, you can create a Normalizer
object and use it to iterate through the normalized form of a string by
calling #first and #next. Finally, you can use the
#setIndex and #getIndex methods to perform
random-access iteration, which is very useful for searching.
Note: Normalizer
objects behave like iterators and have
methods such as setIndex
, next
, previous
, etc.
You should note that while the setIndex
and getIndex
refer
to indices in the underlying input text being processed, the
next
and previous
methods it iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by next
and previous
and the indices passed to and
returned from setIndex
and getIndex
. It is for this
reason that Normalizer
does not implement the
CharacterIterator interface.
Note: Normalizer
is currently based on version 2.1.8
of the Unicode Standard.
It will be updated as later versions of Unicode are released. If you are
using this class on a JDK that supports an earlier version of Unicode, it
is possible that Normalizer
may generate composed or dedecomposed
characters for which your JDK's java.lang.Character class does not
have any data.
Definition at line 128 of file normlzr.h.
|
|
|
If DONE is returned, then there are no more normalization results available.
|
|
The options for a Normalizer object.
|
|
|
|
|
|
The mode of a Normalizer object.
|
|
Creates a new
|
|
Creates a new
The
|
|
Creates a new
|
|
Creates a new
|
|
Creates a new
|
|
Creates a new
|
|
Copy constructor.
|
|
Destructor.
|
|
|
|
|
|
Returns a pointer to a new Normalizer that is a clone of this one. The caller is responsible for deleting the new clone.
|
|
Compose a
The
|
|
|
|
|
|
|
|
|
|
Return the current character in the normalized text.
|
|
Static method to decompose a
The
|
|
|
|
|
|
|
|
Retrieve the index of the end of the input text.
This is the end index
of the
|
|
|
|
Return the first character in the normalized text.
This resets
the
|
|
|
|
|
|
|
|
Retrieve the current iteration position in the input text that is being normalized. This method is useful in applications such as searching, where you need to be able to determine the position in the input text that corresponds to a given normalized output character.
Note: This method sets the position in the input, while
#next and #previous iterate through characters in the
output. This means that there is not necessarily a one-to-one
correspondence between characters returned by
|
|
Return the basic operation performed by this
|
|
Converts C++'s UNormalizationMode to Normalizer::EMode.
|
|
Determine whether an option is turned on or off.
|
|
Returns the text under iteration into the UChar* buffer pointer.
|
|
Copies the text under iteration into the UnicodeString referred to by "result".
|
|
Converts C's Normalizer::EMode to UNormalizationMode.
|
|
|
|
Generates a hash code for this iterator.
|
|
|
|
|
|
|
|
|
|
Return the last character in the normalized text.
This resets
the
|
|
Return the next character in the normalized text and advance the iteration position by one. If the end of the text has already been reached, #DONE is returned.
|
|
|
|
|
|
Normalizes a
The
|
|
|
|
Returns true when both iterators refer to the same character in the same character-storage object.
Referenced by operator!=(). |
|
|
|
|
|
|
|
Return the previous character in the normalized text and decrement the iteration position by one. If the beginning of the text has already been reached, #DONE is returned.
|
|
Performing quick check on a string, to quickly determine if the string is in a particular normalization format. Three types of result can be returned UNORM_YES, UNORM_NO or UNORM_MAYBE. Result UNORM_YES indicates that the argument string is in the desired normalized format, UNORM_NO determines that argument string is not in the desired normalized format. A UNORM_MAYBE result indicates that a more thorough check is required, the user may have to put the string in its normalized form and compare the results.
|
|
Reset the iterator so that it is in the same state that it was just after it was constructed.
A subsequent call to
|
|
Set the iteration position in the input text that is being normalized and return the first normalized character at that position.
Note: This method sets the position in the input text,
while #next and #previous iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by
|
|
Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating
over a string, calls to #next and #previous may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call #setText, #first,
#last, etc. after calling
|
|
Set options that affect this Options do not change the basic composition or decomposition operation that is being performed , but they control whether certain optional portions of the operation are done. Currently the only available option is:
|
|
Set the input text over which this The iteration position is set to the beginning.
|
|
Set the input text over which this The iteration position is set to the beginning.
|
|
Set the input text over which this The iteration position is set to the beginning.
|
|
Retrieve the index of the start of the input text.
This is the begin index
of the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|