com.ibm.text
Class Normalizer

java.lang.Object
  |
  +--com.ibm.text.Normalizer

public final class Normalizer
extends java.lang.Object

Normalizer transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. Normalizer supports the standard normalization forms described in Unicode Technical Report #15.

Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character "Â" (A-acute). In Unicode, this can be encoded as a single character (the "composed" form):

      00C1    LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
      0041    LATIN CAPITAL LETTER A
      0301    COMBINING ACUTE ACCENT

To a user of your program, however, both of these sequences should be treated as the same "user-level" character "Â". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent.

Similarly, the string "ffi" can be encoded as three separate letters:

      0066    LATIN SMALL LETTER F
      0066    LATIN SMALL LETTER F
      0069    LATIN SMALL LETTER I
or as the single character
      FB03    LATIN SMALL LIGATURE FFI

The ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.

Normalizer helps solve these problems by transforming text into the canonical composed and decomposed forms as shown in the first example above. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents. Finally, Normalizer rearranges accents into the proper canonical order, so that you do not have to worry about accent rearrangement on your own.

Normalizer adds one optional behavior, IGNORE_HANGUL, that differs from the standard Unicode Normalization Forms. This option can be passed to the constructors and to the static compose and decompose methods. This option, and any that are added in the future, will be turned off by default.

There are three common usage models for Normalizer. In the first, the static normalize() method is used to process an entire input string at once. Second, you can create a Normalizer object and use it to iterate through the normalized form of a string by calling first() and next(). Finally, you can use the setIndex() and getIndex() methods to perform random-access iteration, which is very useful for searching.

Note: Normalizer objects behave like iterators and have methods such as setIndex, next, previous, etc. You should note that while the setIndex and getIndex refer to indices in the underlying input text being processed, the next and previous methods it iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next and previous and the indices passed to and returned from setIndex and getIndex. It is for this reason that Normalizer does not implement the CharacterIterator interface.

Note: Normalizer is currently based on version 2.1.8 of the Unicode Standard. It will be updated as later versions of Unicode are released. If you are using this class on a JDK that supports an earlier version of Unicode, it is possible that Normalizer may generate composed or dedecomposed characters for which your JDK's Character class does not have any data.

Author:
Laura Werner, Mark Davis

Inner Class Summary
static class Normalizer.Mode
          This class represents the mode of a Normalizer object, i.e. the Unicode Normalization Form of the text that the Normalizer produces.
 
Field Summary
static Normalizer.Mode COMPOSE
          Canonical decomposition followed by canonical composition.
static Normalizer.Mode COMPOSE_COMPAT
          Compatibility decomposition followed by canonical composition.
static Normalizer.Mode DECOMP
          Canonical decomposition.
static Normalizer.Mode DECOMP_COMPAT
          Compatibility decomposition.
static char DONE
          Constant indicating that the end of the iteration has been reached.
static int IGNORE_HANGUL
          Option to disable Hangul/Jamo composition and decomposition.
static Normalizer.Mode NO_OP
          Null operation for use with the constructors and the static normalize method.
 
Constructor Summary
Normalizer(java.text.CharacterIterator iter, Normalizer.Mode mode)
          Creates a new Normalizer object for iterating over the normalized form of the given text.
Normalizer(java.text.CharacterIterator iter, Normalizer.Mode mode, int opt)
          Creates a new Normalizer object for iterating over the normalized form of the given text.
Normalizer(java.lang.String str, Normalizer.Mode mode)
          Creates a new Normalizer object for iterating over the normalized form of a given string.
Normalizer(java.lang.String str, Normalizer.Mode mode, int opt)
          Creates a new Normalizer object for iterating over the normalized form of a given string.
 
Method Summary
 java.lang.Object clone()
          Clones this Normalizer object.
static java.lang.String compose(java.lang.String source, boolean compat, int options)
          Compose a String.
 char current()
          Return the current character in the normalized text.
static java.lang.String decompose(java.lang.String source, boolean compat, int options)
          Static method to decompose a String.
 char first()
          Return the first character in the normalized text.
 int getBeginIndex()
          Retrieve the index of the start of the input text.
 int getEndIndex()
          Retrieve the index of the end of the input text.
 int getIndex()
          Retrieve the current iteration position in the input text that is being normalized.
 Normalizer.Mode getMode()
          Return the basic operation performed by this Normalizer
 boolean getOption(int option)
          Determine whether an option is turned on or off.
 char last()
          Return the last character in the normalized text.
 char next()
          Return the next character in the normalized text and advance the iteration position by one.
static java.lang.String normalize(java.lang.String str, Normalizer.Mode mode, int options)
          Normalizes a String using the given normalization operation.
 char previous()
          Return the previous character in the normalized text and decrement the iteration position by one.
 char setIndex(int index)
          Set the iteration position in the input text that is being normalized and return the first normalized character at that position.
 void setMode(Normalizer.Mode newMode)
          Set the normalization mode for this object.
 void setOption(int option, boolean value)
          Set options that affect this Normalizer's operation.
 void setText(java.text.CharacterIterator newText)
          Set the input text over which this Normalizer will iterate.
 void setText(java.lang.String newText)
          Set the input text over which this Normalizer will iterate.
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DONE

public static final char DONE
Constant indicating that the end of the iteration has been reached. This is guaranteed to have the same value as CharacterIterator.DONE.

NO_OP

public static final Normalizer.Mode NO_OP
Null operation for use with the constructors and the static normalize method. This value tells the Normalizer to do nothing but return unprocessed characters from the underlying String or CharacterIterator. If you have code which requires raw text at some times and normalized text at others, you can use NO_OP for the cases where you want raw text, rather than having a separate code path that bypasses Normalizer altogether.

See Also:
setMode(com.ibm.text.Normalizer.Mode)

COMPOSE

public static final Normalizer.Mode COMPOSE
Canonical decomposition followed by canonical composition. Used with the constructors and the static normalize method to determine the operation to be performed.

If all optional features (e.g. IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form C.

See Also:
setMode(com.ibm.text.Normalizer.Mode)

COMPOSE_COMPAT

public static final Normalizer.Mode COMPOSE_COMPAT
Compatibility decomposition followed by canonical composition. Used with the constructors and the static normalize method to determine the operation to be performed.

If all optional features (e.g. IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form KC.

See Also:
setMode(com.ibm.text.Normalizer.Mode)

DECOMP

public static final Normalizer.Mode DECOMP
Canonical decomposition. This value is passed to the constructors and the static normalize method to determine the operation to be performed.

If all optional features (e.g. IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form D.

See Also:
setMode(com.ibm.text.Normalizer.Mode)

DECOMP_COMPAT

public static final Normalizer.Mode DECOMP_COMPAT
Compatibility decomposition. This value is passed to the constructors and the static normalize method to determine the operation to be performed.

If all optional features (e.g. IGNORE_HANGUL) are turned off, this operation produces output that is in Unicode Canonical Form KD.

See Also:
setMode(com.ibm.text.Normalizer.Mode)

IGNORE_HANGUL

public static final int IGNORE_HANGUL
Option to disable Hangul/Jamo composition and decomposition. This option applies to Korean text, which can be represented either in the Jamo alphabet or in Hangul characters, which are really just two or three Jamo combined into one visual glyph. Since Jamo takes up more storage space than Hangul, applications that process only Hangul text may wish to turn this option on when decomposing text.

The Unicode standard treates Hangul to Jamo conversion as a canonical decomposition, so this option must be turned off if you wish to transform strings into one of the standard Unicode Normalization Forms.

See Also:
setOption(int, boolean)
Constructor Detail

Normalizer

public Normalizer(java.lang.String str,
                  Normalizer.Mode mode)
Creates a new Normalizer object for iterating over the normalized form of a given string.

Parameters:
str - The string to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.

Normalizer

public Normalizer(java.lang.String str,
                  Normalizer.Mode mode,
                  int opt)
Creates a new Normalizer object for iterating over the normalized form of a given string.

The options parameter specifies which optional Normalizer features are to be enabled for this object.

Parameters:
str - The string to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.
opt - Any optional features to be enabled. Currently the only available option is IGNORE_HANGUL. If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument.

Normalizer

public Normalizer(java.text.CharacterIterator iter,
                  Normalizer.Mode mode)
Creates a new Normalizer object for iterating over the normalized form of the given text.

Parameters:
iter - The input text to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.

Normalizer

public Normalizer(java.text.CharacterIterator iter,
                  Normalizer.Mode mode,
                  int opt)
Creates a new Normalizer object for iterating over the normalized form of the given text.

Parameters:
iter - The input text to be normalized. The normalization will start at the beginning of the string.
mode - The normalization mode.
opt - Any optional features to be enabled. Currently the only available option is IGNORE_HANGUL. If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument.
Method Detail

clone

public java.lang.Object clone()
Clones this Normalizer object. All properties of this object are duplicated in the new object, including the cloning of any CharacterIterator that was passed in to the constructor or to setText. However, the text storage underlying the CharacterIterator is not duplicated unless the iterator's clone method does so.
Overrides:
clone in class java.lang.Object

normalize

public static java.lang.String normalize(java.lang.String str,
                                         Normalizer.Mode mode,
                                         int options)
Normalizes a String using the given normalization operation.

The options parameter specifies which optional Normalizer features are to be enabled for this operation. Currently the only available option is IGNORE_HANGUL. If you want the default behavior corresponding to one of the standard Unicode Normalization Forms, use 0 for this argument.

Parameters:
str - the input string to be normalized.
aMode - the normalization mode
options - the optional features to be enabled.

compose

public static java.lang.String compose(java.lang.String source,
                                       boolean compat,
                                       int options)
Compose a String.

The options parameter specifies which optional Normalizer features are to be enabled for this operation. Currently the only available option is IGNORE_HANGUL. If you want the default behavior corresponding to Unicode Normalization Form C or KC, use 0 for this argument.

Parameters:
source - the string to be composed.
compat - Perform compatibility decomposition before composition. If this argument is false, only canonical decomposition will be performed.
options - the optional features to be enabled.
Returns:
the composed string.

decompose

public static java.lang.String decompose(java.lang.String source,
                                         boolean compat,
                                         int options)
Static method to decompose a String.

The options parameter specifies which optional Normalizer features are to be enabled for this operation. Currently the only available option is IGNORE_HANGUL. The desired options should be OR'ed together to determine the value of this argument. If you want the default behavior corresponding to Unicode Normalization Form D or KD, use 0 for this argument.

Parameters:
str - the string to be decomposed.
compat - Perform compatibility decomposition. If this argument is false, only canonical decomposition will be performed.
Returns:
the decomposed string.

current

public char current()
Return the current character in the normalized text.

first

public char first()
Return the first character in the normalized text. This resets the Normalizer's position to the beginning of the text.

last

public char last()
Return the last character in the normalized text. This resets the Normalizer's position to be just before the the input text corresponding to that normalized character.

next

public char next()
Return the next character in the normalized text and advance the iteration position by one. If the end of the text has already been reached, DONE is returned.

previous

public char previous()
Return the previous character in the normalized text and decrement the iteration position by one. If the beginning of the text has already been reached, DONE is returned.

setIndex

public char setIndex(int index)
Set the iteration position in the input text that is being normalized and return the first normalized character at that position.

Parameters:
index - the desired index in the input text.
Returns:
the first normalized character that is the result of iterating forward starting at the given index.
Throws:
java.lang.IllegalArgumentException - if the given index is less than getBeginIndex() or greater than getEndIndex().

getIndex

public final int getIndex()
Retrieve the current iteration position in the input text that is being normalized. This method is useful in applications such as searching, where you need to be able to determine the position in the input text that corresponds to a given normalized output character.

getBeginIndex

public final int getBeginIndex()
Retrieve the index of the start of the input text. This is the begin index of the CharacterIterator or the start (i.e. 0) of the String over which this Normalizer is iterating

getEndIndex

public final int getEndIndex()
Retrieve the index of the end of the input text. This is the end index of the CharacterIterator or the length of the String over which this Normalizer is iterating

setMode

public void setMode(Normalizer.Mode newMode)
Set the normalization mode for this object.

Note:If the normalization mode is changed while iterating over a string, calls to next() and previous() may return previously buffers characters in the old normalization mode until the iteration is able to re-sync at the next base character. It is safest to call setText(), first(), last(), etc. after calling setMode.

Parameters:
newMode - the new mode for this Normalizer. The supported modes are:
  • COMPOSE - Unicode canonical decompositiion followed by canonical composition.
  • COMPOSE_COMPAT - Unicode compatibility decompositiion follwed by canonical composition.
  • DECOMP - Unicode canonical decomposition
  • DECOMP_COMPAT - Unicode compatibility decomposition.
  • NO_OP - Do nothing but return characters from the underlying input text.
See Also:
getMode()

getMode

public Normalizer.Mode getMode()
Return the basic operation performed by this Normalizer
See Also:
setMode(com.ibm.text.Normalizer.Mode)

setOption

public void setOption(int option,
                      boolean value)
Set options that affect this Normalizer's operation. Options do not change the basic composition or decomposition operation that is being performed , but they control whether certain optional portions of the operation are done. Currently the only available option is:

Parameters:
option - the option whose value is to be set.
value - the new setting for the option. Use true to turn the option on and false to turn it off.
See Also:
getOption(int)

getOption

public boolean getOption(int option)
Determine whether an option is turned on or off.

See Also:
setOption(int, boolean)

setText

public void setText(java.lang.String newText)
Set the input text over which this Normalizer will iterate. The iteration position will be reset to the beginning.

Parameters:
newText - The new string to be normalized.

setText

public void setText(java.text.CharacterIterator newText)
Set the input text over which this Normalizer will iterate. The iteration position will be reset to the beginning.

Parameters:
newText - The new text to be normalized.


Copyright (c) 2001 IBM Corporation and others.