|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--com.ibm.text.Normalizer
Normalizer transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. Normalizer supports the standard normalization forms described in Unicode Technical Report #15.
Characters with accents or other adornments can be encoded in several different ways in Unicode. For example, take the character "Â" (A-acute). In Unicode, this can be encoded as a single character (the "composed" form):
00C1 LATIN CAPITAL LETTER A WITH ACUTEor as two separate characters (the "decomposed" form):
0041 LATIN CAPITAL LETTER A 0301 COMBINING ACUTE ACCENT
To a user of your program, however, both of these sequences should be treated as the same "user-level" character "Â". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
0066 LATIN SMALL LETTER F 0066 LATIN SMALL LETTER F 0069 LATIN SMALL LETTER Ior as the single character
FB03 LATIN SMALL LIGATURE FFI
The ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.
Normalizer helps solve these problems by transforming text into the canonical composed and decomposed forms as shown in the first example above. In addition, you can have it perform compatibility decompositions so that you can treat compatibility characters the same as their equivalents. Finally, Normalizer rearranges accents into the proper canonical order, so that you do not have to worry about accent rearrangement on your own.
Normalizer adds one optional behavior, IGNORE_HANGUL
,
that differs from
the standard Unicode Normalization Forms. This option can be passed
to the constructors
and to the static
compose
and decompose
methods. This
option, and any that are added in the future, will be turned off by default.
There are three common usage models for Normalizer. In the first,
the static normalize()
method is used to process an
entire input string at once. Second, you can create a Normalizer
object and use it to iterate through the normalized form of a string by
calling first()
and next()
. Finally, you can use the
setIndex()
and getIndex()
methods to perform
random-access iteration, which is very useful for searching.
Note: Normalizer objects behave like iterators and have
methods such as setIndex, next, previous, etc.
You should note that while the setIndex and getIndex refer
to indices in the underlying input text being processed, the
next and previous methods it iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by next and previous and the indices passed to and
returned from setIndex and getIndex. It is for this
reason that Normalizer does not implement the
CharacterIterator
interface.
Note: Normalizer is currently based on version 2.1.8
of the Unicode Standard.
It will be updated as later versions of Unicode are released. If you are
using this class on a JDK that supports an earlier version of Unicode, it
is possible that Normalizer may generate composed or dedecomposed
characters for which your JDK's Character
class does not
have any data.
Inner Class Summary | |
static class |
Normalizer.Mode
This class represents the mode of a Normalizer
object, i.e. the Unicode Normalization Form of the
text that the Normalizer produces. |
Field Summary | |
static Normalizer.Mode |
COMPOSE
Canonical decomposition followed by canonical composition. |
static Normalizer.Mode |
COMPOSE_COMPAT
Compatibility decomposition followed by canonical composition. |
static Normalizer.Mode |
DECOMP
Canonical decomposition. |
static Normalizer.Mode |
DECOMP_COMPAT
Compatibility decomposition. |
static char |
DONE
Constant indicating that the end of the iteration has been reached. |
static int |
IGNORE_HANGUL
Option to disable Hangul/Jamo composition and decomposition. |
static Normalizer.Mode |
NO_OP
Null operation for use with the constructors
and the static normalize method. |
Constructor Summary | |
Normalizer(java.text.CharacterIterator iter,
Normalizer.Mode mode)
Creates a new Normalizer object for iterating over the normalized form of the given text. |
|
Normalizer(java.text.CharacterIterator iter,
Normalizer.Mode mode,
int opt)
Creates a new Normalizer object for iterating over the normalized form of the given text. |
|
Normalizer(java.lang.String str,
Normalizer.Mode mode)
Creates a new Normalizer object for iterating over the normalized form of a given string. |
|
Normalizer(java.lang.String str,
Normalizer.Mode mode,
int opt)
Creates a new Normalizer object for iterating over the normalized form of a given string. |
Method Summary | |
java.lang.Object |
clone()
Clones this Normalizer object. |
static java.lang.String |
compose(java.lang.String source,
boolean compat,
int options)
Compose a String. |
char |
current()
Return the current character in the normalized text. |
static java.lang.String |
decompose(java.lang.String source,
boolean compat,
int options)
Static method to decompose a String. |
char |
first()
Return the first character in the normalized text. |
int |
getBeginIndex()
Retrieve the index of the start of the input text. |
int |
getEndIndex()
Retrieve the index of the end of the input text. |
int |
getIndex()
Retrieve the current iteration position in the input text that is being normalized. |
Normalizer.Mode |
getMode()
Return the basic operation performed by this Normalizer |
boolean |
getOption(int option)
Determine whether an option is turned on or off. |
char |
last()
Return the last character in the normalized text. |
char |
next()
Return the next character in the normalized text and advance the iteration position by one. |
static java.lang.String |
normalize(java.lang.String str,
Normalizer.Mode mode,
int options)
Normalizes a String using the given normalization operation. |
char |
previous()
Return the previous character in the normalized text and decrement the iteration position by one. |
char |
setIndex(int index)
Set the iteration position in the input text that is being normalized and return the first normalized character at that position. |
void |
setMode(Normalizer.Mode newMode)
Set the normalization mode for this object. |
void |
setOption(int option,
boolean value)
Set options that affect this Normalizer's operation. |
void |
setText(java.text.CharacterIterator newText)
Set the input text over which this Normalizer will iterate. |
void |
setText(java.lang.String newText)
Set the input text over which this Normalizer will iterate. |
Methods inherited from class java.lang.Object |
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final char DONE
CharacterIterator.DONE
.public static final Normalizer.Mode NO_OP
constructors
and the static normalize
method. This value tells
the Normalizer to do nothing but return unprocessed characters
from the underlying String or CharacterIterator. If you have code which
requires raw text at some times and normalized text at others, you can
use NO_OP for the cases where you want raw text, rather
than having a separate code path that bypasses Normalizer
altogether.
setMode(com.ibm.text.Normalizer.Mode)
public static final Normalizer.Mode COMPOSE
constructors
and the static normalize
method to determine the operation to be performed.
If all optional features (e.g. IGNORE_HANGUL
) are turned
off, this operation produces output that is in
Unicode Canonical Form
C.
setMode(com.ibm.text.Normalizer.Mode)
public static final Normalizer.Mode COMPOSE_COMPAT
constructors
and the static
normalize
method to determine the operation to be performed.
If all optional features (e.g. IGNORE_HANGUL
) are turned
off, this operation produces output that is in
Unicode Canonical Form
KC.
setMode(com.ibm.text.Normalizer.Mode)
public static final Normalizer.Mode DECOMP
constructors
and the static normalize
method to determine the operation to be performed.
If all optional features (e.g. IGNORE_HANGUL
) are turned
off, this operation produces output that is in
Unicode Canonical Form
D.
setMode(com.ibm.text.Normalizer.Mode)
public static final Normalizer.Mode DECOMP_COMPAT
constructors
and the static normalize
method to determine the operation to be performed.
If all optional features (e.g. IGNORE_HANGUL
) are turned
off, this operation produces output that is in
Unicode Canonical Form
KD.
setMode(com.ibm.text.Normalizer.Mode)
public static final int IGNORE_HANGUL
The Unicode standard treates Hangul to Jamo conversion as a canonical decomposition, so this option must be turned off if you wish to transform strings into one of the standard Unicode Normalization Forms.
setOption(int, boolean)
Constructor Detail |
public Normalizer(java.lang.String str, Normalizer.Mode mode)
str
- The string to be normalized. The normalization
will start at the beginning of the string.mode
- The normalization mode.public Normalizer(java.lang.String str, Normalizer.Mode mode, int opt)
The options parameter specifies which optional Normalizer features are to be enabled for this object.
str
- The string to be normalized. The normalization
will start at the beginning of the string.mode
- The normalization mode.opt
- Any optional features to be enabled.
Currently the only available option is IGNORE_HANGUL
.
If you want the default behavior corresponding to one of the
standard Unicode Normalization Forms, use 0 for this argument.public Normalizer(java.text.CharacterIterator iter, Normalizer.Mode mode)
iter
- The input text to be normalized. The normalization
will start at the beginning of the string.mode
- The normalization mode.public Normalizer(java.text.CharacterIterator iter, Normalizer.Mode mode, int opt)
iter
- The input text to be normalized. The normalization
will start at the beginning of the string.mode
- The normalization mode.opt
- Any optional features to be enabled.
Currently the only available option is IGNORE_HANGUL
.
If you want the default behavior corresponding to one of the
standard Unicode Normalization Forms, use 0 for this argument.Method Detail |
public java.lang.Object clone()
CharacterIterator
that was passed in to the constructor
or to setText
.
However, the text storage underlying
the CharacterIterator is not duplicated unless the
iterator's clone method does so.clone
in class java.lang.Object
public static java.lang.String normalize(java.lang.String str, Normalizer.Mode mode, int options)
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is IGNORE_HANGUL
.
If you want the default behavior corresponding to one of the standard
Unicode Normalization Forms, use 0 for this argument.
str
- the input string to be normalized.aMode
- the normalization modeoptions
- the optional features to be enabled.public static java.lang.String compose(java.lang.String source, boolean compat, int options)
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is IGNORE_HANGUL
.
If you want the default behavior corresponding
to Unicode Normalization Form C or KC,
use 0 for this argument.
source
- the string to be composed.compat
- Perform compatibility decomposition before composition.
If this argument is false, only canonical
decomposition will be performed.options
- the optional features to be enabled.public static java.lang.String decompose(java.lang.String source, boolean compat, int options)
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is IGNORE_HANGUL
.
The desired options should be OR'ed together to determine the value
of this argument. If you want the default behavior corresponding
to Unicode Normalization Form D or KD,
use 0 for this argument.
str
- the string to be decomposed.compat
- Perform compatibility decomposition.
If this argument is false, only canonical
decomposition will be performed.public char current()
public char first()
public char last()
public char next()
DONE
is returned.public char previous()
DONE
is returned.public char setIndex(int index)
index
- the desired index in the input text.java.lang.IllegalArgumentException
- if the given index is less than
getBeginIndex()
or greater than getEndIndex()
.public final int getIndex()
public final int getBeginIndex()
public final int getEndIndex()
public void setMode(Normalizer.Mode newMode)
Note:If the normalization mode is changed while iterating
over a string, calls to next()
and previous()
may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call setText()
, first()
,
last()
, etc. after calling setMode.
newMode
- the new mode for this Normalizer.
The supported modes are:
COMPOSE
- Unicode canonical decompositiion
followed by canonical composition.
COMPOSE_COMPAT
- Unicode compatibility decompositiion
follwed by canonical composition.
DECOMP
- Unicode canonical decomposition
DECOMP_COMPAT
- Unicode compatibility decomposition.
NO_OP
- Do nothing but return characters
from the underlying input text.
getMode()
public Normalizer.Mode getMode()
setMode(com.ibm.text.Normalizer.Mode)
public void setOption(int option, boolean value)
IGNORE_HANGUL
- Do not decompose Hangul syllables into the Jamo alphabet
and vice-versa. This option is off by default (i.e. Hangul processing
is enabled) since the Unicode standard specifies that Hangul to Jamo
is a canonical decomposition. For any of the standard Unicode Normalization
Forms, you should leave this option off.
option
- the option whose value is to be set.value
- the new setting for the option. Use true to
turn the option on and false to turn it off.getOption(int)
public boolean getOption(int option)
setOption(int, boolean)
public void setText(java.lang.String newText)
newText
- The new string to be normalized.public void setText(java.text.CharacterIterator newText)
newText
- The new text to be normalized.
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |