com.ibm.text
Class Transliterator

java.lang.Object
  |
  +--com.ibm.text.Transliterator
Direct Known Subclasses:
CompoundTransliterator, HangulJamoTransliterator, HexToUnicodeTransliterator, JamoHangulTransliterator, NullTransliterator, RuleBasedTransliterator, UnicodeToHexTransliterator

public abstract class Transliterator
extends java.lang.Object

Transliterator is an abstract class that transliterates text from one format to another. The most common kind of transliterator is a script, or alphabet, transliterator. For example, a Russian to Latin transliterator changes Russian text written in Cyrillic characters to phonetically equivalent Latin characters. It does not translate Russian to English! Transliteration, unlike translation, operates on characters, without reference to the meanings of words and sentences.

Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks. In fact, Transliterator defines a very general API which specifies only that a segment of the input text is replaced by new text. The particulars of this conversion are determined entirely by subclasses of Transliterator.

Transliterators are stateless

Transliterator objects are stateless; they retain no information between calls to transliterate(). As a result, threads may share transliterators without synchronizing them. This might seem to limit the complexity of the transliteration operation. In practice, subclasses perform complex transliterations by delaying the replacement of text until it is known that no other replacements are possible. In other words, although the Transliterator objects are stateless, the source text itself embodies all the needed information, and delayed operation allows arbitrary complexity.

Batch transliteration

The simplest way to perform transliteration is all at once, on a string of existing text. This is referred to as batch transliteration. For example, given a string input and a transliterator t, the call

String result = t.transliterate(input);
will transliterate it and return the result. Other methods allow the client to specify a substring to be transliterated and to use Replaceable objects instead of strings, in order to preserve out-of-band information (such as text styles).

Keyboard transliteration

Somewhat more involved is keyboard, or incremental transliteration. This is the transliteration of text that is arriving from some source (typically the user's keyboard) one character at a time, or in some other piecemeal fashion.

In keyboard transliteration, a Replaceable buffer stores the text. As text is inserted, as much as possible is transliterated on the fly. This means a GUI that displays the contents of the buffer may show text being modified as each new character arrives.

Consider the simple RuleBasedTransliterator:

th>{theta}
t>{tau}
When the user types 't', nothing will happen, since the transliterator is waiting to see if the next character is 'h'. To remedy this, we introduce the notion of a cursor, marked by a '|' in the output string:
t>|{tau}
{tau}h>{theta}
Now when the user types 't', tau appears, and if the next character is 'h', the tau changes to a theta. This is accomplished by maintaining a cursor position (independent of the insertion point, and invisible in the GUI) across calls to transliterate(). Typically, the cursor will be coincident with the insertion point, but in a case like the one above, it will precede the insertion point.

Keyboard transliteration methods maintain a set of three indices that are updated with each call to transliterate(), including the cursor, start, and limit. These indices are changed by the method, and they are passed in and out via a Position object. The start index marks the beginning of the substring that the transliterator will look at. It is advanced as text becomes committed (but it is not the committed index; that's the cursor). The cursor index, described above, marks the point at which the transliterator last stopped, either because it reached the end, or because it required more characters to disambiguate between possible inputs. The cursor can also be explicitly set by rules in a RuleBasedTransliterator. Any characters before the cursor index are frozen; future keyboard transliteration calls within this input sequence will not change them. New text is inserted at the limit index, which marks the end of the substring that the transliterator looks at.

Because keyboard transliteration assumes that more characters are to arrive, it is conservative in its operation. It only transliterates when it can do so unambiguously. Otherwise it waits for more characters to arrive. When the client code knows that no more characters are forthcoming, perhaps because the user has performed some input termination operation, then it should call finishTransliteration() to complete any pending transliterations.

Inverses

Pairs of transliterators may be inverses of one another. For example, if transliterator A transliterates characters by incrementing their Unicode value (so "abc" -> "def"), and transliterator B decrements character values, then A is an inverse of B and vice versa. If we compose A with B in a compound transliterator, the result is the indentity transliterator, that is, a transliterator that does not change its input text. The Transliterator method getInverse() returns a transliterator's inverse, if one exists, or null otherwise. However, the result of getInverse() usually will not be a true mathematical inverse. This is because true inverse transliterators are difficult to formulate. For example, consider two transliterators: AB, which transliterates the character 'A' to 'B', and BA, which transliterates 'B' to 'A'. It might seem that these are exact inverses, since

"A" x AB -> "B"
"B" x BA -> "A"
where 'x' represents transliteration. However,
"ABCD" x AB -> "BBCD"
"BBCD" x BA -> "AACD"
so AB composed with BA is not the identity. Nonetheless, BA may be usefully considered to be AB's inverse, and it is on this basis that AB.getInverse() could legitimately return BA.

IDs and display names

A transliterator is designated by a short identifier string or ID. IDs follow the format source-destination, where source describes the entity being replaced, and destination describes the entity replacing source. The entities may be the names of scripts, particular sequences of characters, or whatever else it is that the transliterator converts to or from. For example, a transliterator from Russian to Latin might be named "Russian-Latin". A transliterator from keyboard escape sequences to Latin-1 characters might be named "KeyboardEscape-Latin1". By convention, system entity names are in English, with the initial letters of words capitalized; user entity names may follow any format so long as they do not contain dashes.

In addition to programmatic IDs, transliterator objects have display names for presentation in user interfaces, returned by getDisplayName(java.lang.String).

Factory methods and registration

In general, client code should use the factory method getInstance() to obtain an instance of a transliterator given its ID. Valid IDs may be enumerated using getAvailableIDs(). Since transliterators are stateless, multiple calls to getInstance() with the same ID will return the same object.

In addition to the system transliterators registered at startup, user transliterators may be registered by calling registerInstance() at run time. To register a transliterator subclass without instantiating it (until it is needed), users may call registerClass().

Composed transliterators

In addition to built-in system transliterators like "Latin-Greek", there are also built-in composed transliterators. These are implemented by composing two or more component transliterators. For example, if we have scripts "A", "B", "C", and "D", and we want to transliterate between all pairs of them, then we need to write 12 transliterators: "A-B", "A-C", "A-D", "B-A",..., "D-A", "D-B", "D-C". If it is possible to convert all scripts to an intermediate script "M", then instead of writing 12 rule sets, we only need to write 8: "A~M", "B~M", "C~M", "D~M", "M~A", "M~B", "M~C", "M~D". (This might not seem like a big win, but it's really 2n vs. n2 - n, so as n gets larger the gain becomes significant. With 9 scripts, it's 18 vs. 72 rule sets, a big difference.) Note the use of "~" rather than "-" for the script separator here; this indicates that the given transliterator is intended to be composed with others, rather than be used as is.

Composed transliterators can be instantiated as usual. For example, the system transliterator "Devanagari-Gujarati" is a composed transliterator built internally as "Devanagari~InterIndic;InterIndic~Gujarati". When this transliterator is instantiated, it appears externally to be a standard transliterator (e.g., getID() returns "Devanagari-Gujarati").

Subclassing

Subclasses must implement the abstract method handleTransliterate().

Subclasses should override the transliterate() method taking a Replaceable and the transliterate() method taking a String and StringBuffer if the performance of these methods can be improved over the performance obtained by the default implementations in this class.

Copyright © IBM Corporation 1999. All rights reserved.

Version:
$RCSfile: Transliterator.java,v $ $Revision: 1.27 $ $Date: 2001/03/30 22:50:08 $
Author:
Alan Liu

Inner Class Summary
static class Transliterator.Position
          Position structure for incremental transliteration.
 
Field Summary
static int FORWARD
          Direction constant indicating the forward direction in a transliterator, e.g., the forward rules of a RuleBasedTransliterator.
static int REVERSE
          Direction constant indicating the reverse direction in a transliterator, e.g., the reverse rules of a RuleBasedTransliterator.
 
Constructor Summary
protected Transliterator(java.lang.String ID, UnicodeFilter filter)
          Default constructor.
 
Method Summary
protected  char filteredCharAt(Replaceable text, int i)
          Method for subclasses to use to obtain a character in the given string, with filtering.
 void finishTransliteration(Replaceable text, Transliterator.Position index)
          Finishes any pending transliterations that were waiting for more characters.
static java.util.Enumeration getAvailableIDs()
          Returns an enumeration over the programmatic names of registered Transliterator objects.
static java.lang.String getDisplayName(java.lang.String ID)
          Returns a name for this transliterator that is appropriate for display to the user in the default locale.
static java.lang.String getDisplayName(java.lang.String ID, java.util.Locale inLocale)
          Returns a name for this transliterator that is appropriate for display to the user in the given locale.
 UnicodeFilter getFilter()
          Returns the filter used by this transliterator, or null if this transliterator uses no filter.
 java.lang.String getID()
          Returns a programmatic identifier for this transliterator.
static Transliterator getInstance(java.lang.String ID)
           
static Transliterator getInstance(java.lang.String ID, int direction)
          Returns a Transliterator object given its ID.
 Transliterator getInverse()
          Returns this transliterator's inverse.
protected  int getMaximumContextLength()
          Returns the length of the longest context required by this transliterator.
protected abstract  void handleTransliterate(Replaceable text, Transliterator.Position pos, boolean incremental)
          Abstract method that concrete subclasses define to implement keyboard transliteration.
static void registerClass(java.lang.String ID, java.lang.Class transClass, java.lang.String displayName)
          Registers a subclass of Transliterator with the system.
 void setFilter(UnicodeFilter filter)
          Changes the filter used by this transliterator.
protected  void setMaximumContextLength(int a)
          Method for subclasses to use to set the maximum context length.
 void transliterate(Replaceable text)
          Transliterates an entire string in place.
 int transliterate(Replaceable text, int start, int limit)
          Transliterates a segment of a string, with optional filtering.
 void transliterate(Replaceable text, Transliterator.Position index)
          Transliterates the portion of the text buffer that can be transliterated unambiguosly.
 void transliterate(Replaceable text, Transliterator.Position index, char insertion)
          Transliterates the portion of the text buffer that can be transliterated unambiguosly after a new character has been inserted, typically as a result of a keyboard event.
 void transliterate(Replaceable text, Transliterator.Position index, java.lang.String insertion)
          Transliterates the portion of the text buffer that can be transliterated unambiguosly after new text has been inserted, typically as a result of a keyboard event.
 java.lang.String transliterate(java.lang.String text)
          Transliterate an entire string and returns the result.
static java.lang.Object unregister(java.lang.String ID)
          Unregisters a transliterator or class.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FORWARD

public static final int FORWARD
Direction constant indicating the forward direction in a transliterator, e.g., the forward rules of a RuleBasedTransliterator. An "A-B" transliterator transliterates A to B when operating in the forward direction, and B to A when operating in the reverse direction.
See Also:
RuleBasedTransliterator, CompoundTransliterator

REVERSE

public static final int REVERSE
Direction constant indicating the reverse direction in a transliterator, e.g., the reverse rules of a RuleBasedTransliterator. An "A-B" transliterator transliterates A to B when operating in the forward direction, and B to A when operating in the reverse direction.
See Also:
RuleBasedTransliterator, CompoundTransliterator
Constructor Detail

Transliterator

protected Transliterator(java.lang.String ID,
                         UnicodeFilter filter)
Default constructor.
Parameters:
ID - the string identifier for this transliterator
filter - the filter. Any character for which filter.contains() returns false will not be altered by this transliterator. If filter is null then no filtering is applied.
Method Detail

transliterate

public final int transliterate(Replaceable text,
                               int start,
                               int limit)
Transliterates a segment of a string, with optional filtering.
Parameters:
text - the string to be transliterated
start - the beginning index, inclusive; 0 <= start <= limit.
limit - the ending index, exclusive; start <= limit <= text.length().
filter - the filter. Any character for which filter.contains() returns false will not be altered by this transliterator. If filter is null then no filtering is applied.
Returns:
The new limit index. The text previously occupying [start, limit) has been transliterated, possibly to a string of a different length, at [start, new-limit), where new-limit is the return value.

transliterate

public final void transliterate(Replaceable text)
Transliterates an entire string in place. Convenience method.
Parameters:
text - the string to be transliterated

transliterate

public final java.lang.String transliterate(java.lang.String text)
Transliterate an entire string and returns the result. Convenience method.
Parameters:
text - the string to be transliterated
Returns:
The transliterated text

transliterate

public final void transliterate(Replaceable text,
                                Transliterator.Position index,
                                java.lang.String insertion)
Transliterates the portion of the text buffer that can be transliterated unambiguosly after new text has been inserted, typically as a result of a keyboard event. The new text in insertion will be inserted into text at index.contextLimit, advancing index.contextLimit by insertion.length(). Then the transliterator will try to transliterate characters of text between index.start and index.contextLimit. Characters before index.start will not be changed.

Upon return, values in index will be updated. index.contextStart will be advanced to the first character that future calls to this method will read. index.start and index.contextLimit will be adjusted to delimit the range of text that future calls to this method may change.

Typical usage of this method begins with an initial call with index.contextStart and index.contextLimit set to indicate the portion of text to be transliterated, and index.start == index.contextStart. Thereafter, index can be used without modification in future calls, provided that all changes to text are made via this method.

This method assumes that future calls may be made that will insert new text into the buffer. As a result, it only performs unambiguous transliterations. After the last call to this method, there may be untransliterated text that is waiting for more input to resolve an ambiguity. In order to perform these pending transliterations, clients should call finishTransliteration(com.ibm.text.Replaceable, com.ibm.text.Transliterator.Position) after the last call to this method has been made.

Parameters:
text - the buffer holding transliterated and untransliterated text
index - the start and limit of the text, the position of the cursor, and the start and limit of transliteration.
insertion - text to be inserted and possibly transliterated into the translation buffer at index.contextLimit. If null then no text is inserted.
Throws:
java.lang.IllegalArgumentException - if index is invalid
See Also:
handleTransliterate(com.ibm.text.Replaceable, com.ibm.text.Transliterator.Position, boolean)

transliterate

public final void transliterate(Replaceable text,
                                Transliterator.Position index,
                                char insertion)
Transliterates the portion of the text buffer that can be transliterated unambiguosly after a new character has been inserted, typically as a result of a keyboard event. This is a convenience method; see transliterate(Replaceable, Transliterator.Position, String) for details.
Parameters:
text - the buffer holding transliterated and untransliterated text
index - the start and limit of the text, the position of the cursor, and the start and limit of transliteration.
insertion - text to be inserted and possibly transliterated into the translation buffer at index.contextLimit.
See Also:
transliterate(Replaceable, Transliterator.Position, String)

transliterate

public final void transliterate(Replaceable text,
                                Transliterator.Position index)
Transliterates the portion of the text buffer that can be transliterated unambiguosly. This is a convenience method; see transliterate(Replaceable, Transliterator.Position, String) for details.
Parameters:
text - the buffer holding transliterated and untransliterated text
index - the start and limit of the text, the position of the cursor, and the start and limit of transliteration.
See Also:
transliterate(Replaceable, Transliterator.Position, String)

finishTransliteration

public final void finishTransliteration(Replaceable text,
                                        Transliterator.Position index)
Finishes any pending transliterations that were waiting for more characters. Clients should call this method as the last call after a sequence of one or more calls to transliterate().
Parameters:
text - the buffer holding transliterated and untransliterated text.
index - the array of indices previously passed to transliterate(com.ibm.text.Replaceable, int, int)

handleTransliterate

protected abstract void handleTransliterate(Replaceable text,
                                            Transliterator.Position pos,
                                            boolean incremental)
Abstract method that concrete subclasses define to implement keyboard transliteration. This method should transliterate all characters between index.start and index.contextLimit that can be unambiguously transliterated, regardless of future insertions of text at index.contextLimit. index.start should be advanced past committed characters (those that will not change in future calls to this method). index.contextLimit should be updated to reflect text replacements that shorten or lengthen the text between index.start and index.contextLimit. Upon return, neither index.start nor index.contextLimit should be less than the initial value of index.start. index.contextStart should not be changed.
Parameters:
text - the buffer holding transliterated and untransliterated text
pos - the start and limit of the text, the position of the cursor, and the start and limit of transliteration.
incremental - if true, assume more text may be coming after pos.contextLimit. Otherwise, assume the text is complete.
See Also:
transliterate(com.ibm.text.Replaceable, int, int)

getMaximumContextLength

protected final int getMaximumContextLength()
Returns the length of the longest context required by this transliterator. This is preceding context. The default value is zero, but subclasses can change this by calling setMaximumContextLength(). For example, if a transliterator translates "ddd" (where d is any digit) to "555" when preceded by "(ddd)", then the preceding context length is 5, the length of "(ddd)".
Returns:
The maximum number of preceding context characters this transliterator needs to examine

setMaximumContextLength

protected void setMaximumContextLength(int a)
Method for subclasses to use to set the maximum context length.
See Also:
getMaximumContextLength()

getID

public final java.lang.String getID()
Returns a programmatic identifier for this transliterator. If this identifier is passed to getInstance(), it will return this object, if it has been registered.
See Also:
registerClass(java.lang.String, java.lang.Class, java.lang.String), getAvailableIDs()

getDisplayName

public static final java.lang.String getDisplayName(java.lang.String ID)
Returns a name for this transliterator that is appropriate for display to the user in the default locale. See getDisplayName(String,Locale) for details.

getDisplayName

public static java.lang.String getDisplayName(java.lang.String ID,
                                              java.util.Locale inLocale)
Returns a name for this transliterator that is appropriate for display to the user in the given locale. This name is taken from the locale resource data in the standard manner of the java.text package.

If no localized names exist in the system resource bundles, a name is synthesized using a localized MessageFormat pattern from the resource data. The arguments to this pattern are an integer followed by one or two strings. The integer is the number of strings, either 1 or 2. The strings are formed by splitting the ID for this transliterator at the first '-'. If there is no '-', then the entire ID forms the only string.

Parameters:
inLocale - the Locale in which the display name should be localized.
See Also:
MessageFormat

getFilter

public UnicodeFilter getFilter()
Returns the filter used by this transliterator, or null if this transliterator uses no filter.

setFilter

public void setFilter(UnicodeFilter filter)
Changes the filter used by this transliterator. If the filter is set to null then no filtering will occur.

Callers must take care if a transliterator is in use by multiple threads. The filter should not be changed by one thread while another thread may be transliterating.


getInstance

public static Transliterator getInstance(java.lang.String ID,
                                         int direction)
Returns a Transliterator object given its ID. The ID must be either a system transliterator ID or a ID registered using registerClass().
Parameters:
ID - a valid ID, as enumerated by getAvailableIDs()
Returns:
A Transliterator object with the given ID
Throws:
java.lang.IllegalArgumentException - if the given ID is invalid.
See Also:
registerClass(java.lang.String, java.lang.Class, java.lang.String), getAvailableIDs(), getID()

getInstance

public static final Transliterator getInstance(java.lang.String ID)

getInverse

public final Transliterator getInverse()
Returns this transliterator's inverse. See the class documentation for details. This implementation simply inverts the two entities in the ID and attempts to retrieve the resulting transliterator. That is, if getID() returns "A-B", then this method will return the result of getInstance("B-A"), or null if that call fails.

This method does not take filtering into account. The returned transliterator will have no filter.

Subclasses with knowledge of their inverse may wish to override this method.

Returns:
a transliterator that is an inverse, not necessarily exact, of this transliterator, or null if no such transliterator is registered.
See Also:
registerClass(java.lang.String, java.lang.Class, java.lang.String)

registerClass

public static void registerClass(java.lang.String ID,
                                 java.lang.Class transClass,
                                 java.lang.String displayName)
Registers a subclass of Transliterator with the system. This subclass must have a public constructor taking no arguments. When that constructor is called, the resulting object must return the ID passed to this method if its getID() method is called.
Parameters:
ID - the result of getID() for this transliterator
transClass - a subclass of Transliterator
See Also:
unregister(java.lang.String)

unregister

public static java.lang.Object unregister(java.lang.String ID)
Unregisters a transliterator or class. This may be either a system transliterator or a user transliterator or class.
Parameters:
ID - the ID of the transliterator or class
Returns:
the Object that was registered with ID, or null if none was
See Also:
registerClass(java.lang.String, java.lang.Class, java.lang.String)

getAvailableIDs

public static final java.util.Enumeration getAvailableIDs()
Returns an enumeration over the programmatic names of registered Transliterator objects. This includes both system transliterators and user transliterators registered using registerClass(). The enumerated names may be passed to getInstance().
Returns:
An Enumeration over String objects
See Also:
getInstance(java.lang.String, int), registerClass(java.lang.String, java.lang.Class, java.lang.String)

filteredCharAt

protected char filteredCharAt(Replaceable text,
                              int i)
Method for subclasses to use to obtain a character in the given string, with filtering. If the character at the given offset is excluded by this transliterator's filter, then U+FFFE is returned.


Copyright (c) 1998-2000 IBM Corporation and others.