Transliterator
is an abstract class that transliterates text from one format to another.
More...
#include <translit.h>
Inheritance diagram for Transliterator:
Public Methods | |
virtual | ~Transliterator () |
Destructor. More... | |
virtual Transliterator* | clone () const |
Implements Cloneable. More... | |
virtual int32_t | transliterate (Replaceable& text, int32_t start, int32_t limit) const |
Transliterates a segment of a string, with optional filtering. More... | |
virtual void | transliterate (Replaceable& text) const |
Transliterates an entire string in place. More... | |
virtual void | transliterate (Replaceable& text, UTransPosition& index, const UnicodeString& insertion, UErrorCode& status) const |
Transliterates the portion of the text buffer that can be transliterated unambiguosly after new text has been inserted, typically as a result of a keyboard event. More... | |
virtual void | transliterate (Replaceable& text, UTransPosition& index, UChar insertion, UErrorCode& status) const |
Transliterates the portion of the text buffer that can be transliterated unambiguosly after a new character has been inserted, typically as a result of a keyboard event. More... | |
virtual void | transliterate (Replaceable& text, UTransPosition& index, UErrorCode& status) const |
Transliterates the portion of the text buffer that can be transliterated unambiguosly. More... | |
virtual void | finishTransliteration (Replaceable& text, UTransPosition& index) const |
Finishes any pending transliterations that were waiting for more characters. More... | |
int32_t | getMaximumContextLength (void) const |
Returns the length of the longest context required by this transliterator. More... | |
virtual const UnicodeString& | getID (void) const |
Returns a programmatic identifier for this transliterator. More... | |
virtual const UnicodeFilter* | getFilter (void) const |
Returns the filter used by this transliterator, or NULL if this transliterator uses no filter. More... | |
UnicodeFilter* | orphanFilter (void) |
Returns the filter used by this transliterator, or NULL if this transliterator uses no filter. More... | |
virtual void | adoptFilter (UnicodeFilter* adoptedFilter) |
Changes the filter used by this transliterator. More... | |
Transliterator* | createInverse (void) const |
Returns this transliterator's inverse. More... | |
Static Public Methods | |
UnicodeString& | getDisplayName (const UnicodeString& ID, UnicodeString& result) |
Returns a name for this transliterator that is appropriate for display to the user in the default locale. More... | |
UnicodeString& | getDisplayName (const UnicodeString& ID, const Locale& inLocale, UnicodeString& result) |
Returns a name for this transliterator that is appropriate for display to the user in the given locale. More... | |
Transliterator* | createInstance (const UnicodeString& ID, UTransDirection dir = UTRANS_FORWARD, UParseError* parseError = 0) |
Returns a Transliterator object given its ID. More... | |
void | registerInstance (Transliterator* adoptedObj, UErrorCode& status) |
Registers a instance obj of a subclass of Transliterator with the system. More... | |
void | unregister (const UnicodeString& ID) |
Unregisters a transliterator or class. More... | |
int32_t | countAvailableIDs (void) |
Return the number of IDs currently registered with the system. More... | |
const UnicodeString& | getAvailableID (int32_t index) |
Return the index-th available ID. More... | |
Protected Methods | |
Transliterator (const UnicodeString& ID, UnicodeFilter* adoptedFilter) | |
Default constructor. More... | |
Transliterator (const Transliterator&) | |
Copy constructor. | |
Transliterator& | operator= (const Transliterator&) |
Assignment operator. | |
virtual void | handleTransliterate (Replaceable& text, UTransPosition& index, UBool incremental) const = 0 |
Abstract method that concrete subclasses define to implement keyboard transliteration. More... | |
void | setMaximumContextLength (int32_t maxContextLength) |
Method for subclasses to use to set the maximum context length. More... | |
UChar | filteredCharAt (const Replaceable& text, int32_t i) const |
Method for subclasses to use to obtain a character in the given string, with filtering. More... | |
void | setID (const UnicodeString& id) |
Set the ID of this transliterators. More... | |
Private Methods | |
void | _transliterate (Replaceable& text, UTransPosition& index, const UnicodeString* insertion, UErrorCode &status) const |
This internal method does incremental transliteration. More... | |
Private Attributes | |
UnicodeString | ID |
Programmatic name, e.g., "Latin-Arabic". More... | |
UnicodeFilter* | filter |
This transliterator's filter. More... | |
int32_t | maximumContextLength |
Static Private Methods | |
Transliterator* | _createInstance (const UnicodeString& ID, UnicodeString& aliasReturn, UParseError* parseError = 0) |
Returns a transliterator object given its ID. More... | |
void | _registerInstance (Transliterator* adoptedPrototype, UErrorCode &status) |
This internal method registers a prototype instance in the cache. More... | |
void | _unregister (const UnicodeString& ID) |
Unregisters a transliterator or class. More... | |
UBool | compareIDs (void* a, void* b) |
Comparison function for UVector. More... | |
void | initializeCache (void) |
Static Private Attributes | |
Hashtable* | cache |
Cache of public system transliterators. More... | |
Hashtable* | internalCache |
Like 'cache', but IDs are not public. More... | |
UMTX | cacheMutex |
The mutex controlling access to the caches. More... | |
UBool | cacheInitialized |
When set to TRUE, the cache has been initialized. More... | |
const char* | RB_DISPLAY_NAME_PREFIX |
Prefix for resource bundle key for the display name for a transliterator. More... | |
const char* | RB_SCRIPT_DISPLAY_NAME_PREFIX |
Prefix for resource bundle key for the display name for a transliterator SCRIPT. More... | |
const char* | RB_DISPLAY_NAME_PATTERN |
Resource bundle key for display name pattern. More... | |
const char* | RB_RULE_BASED_IDS |
Resource bundle key for the list of RuleBasedTransliterator IDs. More... | |
const char* | RB_RULE |
Resource bundle key for the RuleBasedTransliterator rule. More... | |
UVector | cacheIDs |
Vector of registered IDs. More... | |
const UChar | ID_SEP |
const UChar | ID_DELIM |
Friends | |
class | CompoundTransliterator |
Transliterator
is an abstract class that transliterates text from one format to another.
The most common kind of transliterator is a script, or alphabet, transliterator. For example, a Russian to Latin transliterator changes Russian text written in Cyrillic characters to phonetically equivalent Latin characters. It does not translate Russian to English! Transliteration, unlike translation, operates on characters, without reference to the meanings of words and sentences.
Although script conversion is its most common use, a transliterator can actually perform a more general class of tasks. In fact, Transliterator
defines a very general API which specifies only that a segment of the input text is replaced by new text. The particulars of this conversion are determined entirely by subclasses of Transliterator
.
Transliterators are stateless
Transliterator
objects are stateless; they retain no information between calls to transliterate()
. (However, this does not mean that threads may share transliterators without synchronizing them. Transliterators are not immutable, so they must be synchronized when shared between threads.) This1 might seem to limit the complexity of the transliteration operation. In practice, subclasses perform complex transliterations by delaying the replacement of text until it is known that no other replacements are possible. In other words, although the Transliterator
objects are stateless, the source text itself embodies all the needed information, and delayed operation allows arbitrary complexity.
Batch transliteration
The simplest way to perform transliteration is all at once, on a string of existing text. This is referred to as batch transliteration. For example, given a string input
and a transliterator t
, the call
String result = t.transliterate(input);
will transliterate it and return the result. Other methods allow the client to specify a substring to be transliterated and to use Replaceable objects instead of strings, in order to preserve out-of-band information (such as text styles).
Keyboard transliteration
Somewhat more involved is keyboard, or incremental transliteration. This is the transliteration of text that is arriving from some source (typically the user's keyboard) one character at a time, or in some other piecemeal fashion.
In keyboard transliteration, a Replaceable
buffer stores the text. As text is inserted, as much as possible is transliterated on the fly. This means a GUI that displays the contents of the buffer may show text being modified as each new character arrives.
Consider the simple RuleBasedTransliterator
:
th>{theta}
t>{tau}
When the user types 't', nothing will happen, since the transliterator is waiting to see if the next character is 'h'. To remedy this, we introduce the notion of a cursor, marked by a '|' in the output string:
t>|{tau}
{tau}h>{theta}
Now when the user types 't', tau appears, and if the next character is 'h', the tau changes to a theta. This is accomplished by maintaining a cursor position (independent of the insertion point, and invisible in the GUI) across calls to transliterate()
. Typically, the cursor will be coincident with the insertion point, but in a case like the one above, it will precede the insertion point.
Keyboard transliteration methods maintain a set of three indices that are updated with each call to transliterate()
, including the cursor, start, and limit. Since these indices are changed by the method, they are passed in an int[]
array. The START
index marks the beginning of the substring that the transliterator will look at. It is advanced as text becomes committed (but it is not the committed index; that's the CURSOR
). The CURSOR
index, described above, marks the point at which the transliterator last stopped, either because it reached the end, or because it required more characters to disambiguate between possible inputs. The CURSOR
can also be explicitly set by rules in a RuleBasedTransliterator
. Any characters before the CURSOR
index are frozen; future keyboard transliteration calls within this input sequence will not change them. New text is inserted at the LIMIT
index, which marks the end of the substring that the transliterator looks at.
Because keyboard transliteration assumes that more characters are to arrive, it is conservative in its operation. It only transliterates when it can do so unambiguously. Otherwise it waits for more characters to arrive. When the client code knows that no more characters are forthcoming, perhaps because the user has performed some input termination operation, then it should call finishTransliteration()
to complete any pending transliterations.
Inverses
Pairs of transliterators may be inverses of one another. For example, if transliterator A transliterates characters by incrementing their Unicode value (so "abc" -> "def"), and transliterator B decrements character values, then A is an inverse of B and vice versa. If we compose A with B in a compound transliterator, the result is the indentity transliterator, that is, a transliterator that does not change its input text.
The Transliterator
method getInverse()
returns a transliterator's inverse, if one exists, or null
otherwise. However, the result of getInverse()
usually will not be a true mathematical inverse. This is because true inverse transliterators are difficult to formulate. For example, consider two transliterators: AB, which transliterates the character 'A' to 'B', and BA, which transliterates 'B' to 'A'. It might seem that these are exact inverses, since
"A" x AB -> "B"
"B" x BA -> "A"
where 'x' represents transliteration. However,
"ABCD" x AB -> "BBCD"
"BBCD" x BA -> "AACD"
so AB composed with BA is not the identity. Nonetheless, BA may be usefully considered to be AB's inverse, and it is on this basis that AB.getInverse()
could legitimately return BA.
IDs and display names
A transliterator is designated by a short identifier string or ID. IDs follow the format source-destination, where source describes the entity being replaced, and destination describes the entity replacing source. The entities may be the names of scripts, particular sequences of characters, or whatever else it is that the transliterator converts to or from. For example, a transliterator from Russian to Latin might be named "Russian-Latin". A transliterator from keyboard escape sequences to Latin-1 characters might be named "KeyboardEscape-Latin1". By convention, system entity names are in English, with the initial letters of words capitalized; user entity names may follow any format so long as they do not contain dashes.
In addition to programmatic IDs, transliterator objects have display names for presentation in user interfaces, returned by #getDisplayName.
Factory methods and registration
In general, client code should use the factory method getInstance()
to obtain an instance of a transliterator given its ID. Valid IDs may be enumerated using getAvailableIDs()
. Since transliterators are mutable, multiple calls to getInstance()
with the same ID will return distinct objects.
In addition to the system transliterators registered at startup, user transliterators may be registered by calling registerInstance()
at run time. A registered instance acts a template; future calls to getInstance()
with the ID of the registered object return clones of that object. Thus any object passed to registerInstance()
must implement clone()
propertly. To register a transliterator subclass without instantiating it (until it is needed), users may call registerClass()
. In this case, the objects are instantiated by invoking the zero-argument public constructor of the class.
Subclassing
Subclasses must implement the abstract method handleTransliterate()
.
Subclasses should override the transliterate()
method taking a Replaceable
and the transliterate()
method taking a String
and StringBuffer
if the performance of these methods can be improved over the performance obtained by the default implementations in this class.
Definition at line 225 of file translit.h.
|
Default constructor.
|
|
Copy constructor.
|
|
Destructor.
|
|
Returns a transliterator object given its ID. Unlike getInstance(), this method returns null if it cannot make use of the given ID.
|
|
This internal method registers a prototype instance in the cache. The CALLER MUST MUTEX using cacheMutex before calling this method. |
|
This internal method does incremental transliteration. If the 'insertion' is non-null then we append it to 'text' before proceeding. This method calls through to the pure virtual framework method handleTransliterate() to do the actual work. |
|
Unregisters a transliterator or class. Internal method. Prerequisites: The cache must be initialized, and the caller must own the cacheMutex. |
|
Changes the filter used by this transliterator.
If the filter is set to
Callers must take care if a transliterator is in use by multiple threads. The filter should not be changed by one thread while another thread may be transliterating.
Reimplemented in CompoundTransliterator. |
|
Implements Cloneable.
All subclasses are encouraged to implement this method if it is possible and reasonable to do so. Subclasses that are to be registered with the system using
Reimplemented in CompoundTransliterator, HangulJamoTransliterator, HexToUnicodeTransliterator, JamoHangulTransliterator, NullTransliterator, RuleBasedTransliterator, RemoveTransliterator, and UnicodeToHexTransliterator.
Definition at line 384 of file translit.h. |
|
Comparison function for UVector. Compares two UnicodeString objects given void* pointers to them. |
|
Return the number of IDs currently registered with the system. To retrieve the actual IDs, call getAvailableID(i) with i from 0 to countAvailableIDs() - 1.
|
|
Returns a
The ID must be either a system transliterator ID or a ID registered using
|
|
Returns this transliterator's inverse.
See the class documentation for details. This implementation simply inverts the two entities in the ID and attempts to retrieve the resulting transliterator. That is, if
This method does not take filtering into account. The returned transliterator will have no filter.
Subclasses with knowledge of their inverse may wish to override this method.
|
|
Method for subclasses to use to obtain a character in the given string, with filtering. If the character at the given offset is excluded by this transliterator's filter, then U+FFFE is returned. |
|
Finishes any pending transliterations that were waiting for more characters.
Clients should call this method as the last call after a sequence of one or more calls to
|
|
Return the index-th available ID. index must be between 0 and countAvailableIDs() - 1, inclusive. If index is out of range, the result of getAvailableID(0) is returned.
|
|
Returns a name for this transliterator that is appropriate for display to the user in the given locale.
This name is taken from the locale resource data in the standard manner of the
If no localized names exist in the system resource bundles, a name is synthesized using a localized
|
|
Returns a name for this transliterator that is appropriate for display to the user in the default locale. See #getDisplayName(Locale) for details.
|
|
Returns the filter used by this transliterator, or
|
|
Returns a programmatic identifier for this transliterator.
If this identifier is passed to
|
|
Returns the length of the longest context required by this transliterator.
This is preceding context. The default implementation supplied by
Definition at line 849 of file translit.h. |
|
Abstract method that concrete subclasses define to implement keyboard transliteration.
This method should transliterate all characters between
Reimplemented in CompoundTransliterator, HangulJamoTransliterator, HexToUnicodeTransliterator, JamoHangulTransliterator, NullTransliterator, RuleBasedTransliterator, RemoveTransliterator, and UnicodeToHexTransliterator. |
|
|
|
Assignment operator.
|
|
Returns the filter used by this transliterator, or
The caller must eventually delete the result. After this call, this transliterator's filter is set to |
|
Registers a instance
When After this call the Transliterator class owns the adoptedObj and will delete it.
|
|
Set the ID of this transliterators. Subclasses shouldn't do this, unless the underlying script behavior has changed. Definition at line 853 of file translit.h. |
|
Method for subclasses to use to set the maximum context length.
|
|
Transliterates the portion of the text buffer that can be transliterated unambiguosly. This is a convenience method; see #transliterate(Replaceable, for details.
|
|
Transliterates the portion of the text buffer that can be transliterated unambiguosly after a new character has been inserted, typically as a result of a keyboard event. This is a convenience method; see #transliterate(Replaceable, for details.
|
|
Transliterates the portion of the text buffer that can be transliterated unambiguosly after new text has been inserted, typically as a result of a keyboard event.
The new text in
Upon return, values in
Typical usage of this method begins with an initial call with
This method assumes that future calls may be made that will insert new text into the buffer. As a result, it only performs unambiguous transliterations. After the last call to this method, there may be untransliterated text that is waiting for more input to resolve an ambiguity. In order to perform these pending transliterations, clients should call #finishTransliteration after the last call to this method has been made.
|
|
Transliterates an entire string in place. Convenience method.
|
|
Transliterates a segment of a string, with optional filtering.
|
|
Unregisters a transliterator or class. This may be either a system transliterator or a user transliterator or class.
|
|
Definition at line 575 of file translit.h. |
|
Programmatic name, e.g., "Latin-Arabic".
Reimplemented in NullTransliterator, and RemoveTransliterator. Definition at line 232 of file translit.h. |
|
Definition at line 846 of file translit.h. |
|
Definition at line 845 of file translit.h. |
|
Resource bundle key for display name pattern. The resource bundle value should be a String forming a MessageFormat pattern, e.g.: "{0,choice,0#|1#{1} Transliterator|2#{1} to {2} Transliterator}". Definition at line 326 of file translit.h. |
|
Prefix for resource bundle key for the display name for a transliterator. The ID is appended to this to form the key. The resource bundle value should be a String. Definition at line 311 of file translit.h. |
|
Resource bundle key for the RuleBasedTransliterator rule.
Definition at line 339 of file translit.h. |
|
Resource bundle key for the list of RuleBasedTransliterator IDs. The resource bundle value should be a String[] with each element being a valid ID. The ID will be appended to RB_RULE_BASED_PREFIX to obtain the class name in which the RB_RULE key will be sought. Definition at line 334 of file translit.h. |
|
Prefix for resource bundle key for the display name for a transliterator SCRIPT. The ID is appended to this to form the key. The resource bundle value should be a String. Definition at line 318 of file translit.h. |
|
Cache of public system transliterators. Keys are UnicodeString names, values are CacheEntry objects. Definition at line 248 of file translit.h. |
|
Vector of registered IDs.
Definition at line 797 of file translit.h. |
|
When set to TRUE, the cache has been initialized. Any code must check this boolean before accessing the cache, and if the boolean is FALSE, it must call initializeCache(). We do this form of lazy evaluation for two reasons: (1) so we don't initialize if we don't have to (i.e., if no one is using Transliterator, but has included the code as part of a shared library, and (2) to avoid static intialization problems. Definition at line 270 of file translit.h. |
|
The mutex controlling access to the caches.
Definition at line 259 of file translit.h. |
|
This transliterator's filter.
Any character for which Definition at line 240 of file translit.h. |
|
Like 'cache', but IDs are not public. Internal transliterators are combined together and aliased to public IDs. Definition at line 254 of file translit.h. |
|
Definition at line 242 of file translit.h. |