Main Page   Class Hierarchy   Alphabetical List   Compound List   File List   Compound Members   File Members  

RuleBasedTransliterator Class Reference

RuleBasedTransliterator is a transliterator that reads a set of rules in order to determine how to perform translations. More...

#include <rbt.h>

Inheritance diagram for RuleBasedTransliterator:

Transliterator List of all members.

Public Types

enum  {
  PARSE_ERROR_BASE = 0x10000, BAD_VARIABLE_DEFINITION, MALFORMED_RULE, MALFORMED_SET,
  MALFORMED_SYMBOL_REFERENCE, MALFORMED_UNICODE_ESCAPE, MALFORMED_VARIABLE_DEFINITION, MALFORMED_VARIABLE_REFERENCE,
  MISMATCHED_SEGMENT_DELIMITERS, MISPLACED_ANCHOR_START, MISPLACED_CURSOR_OFFSET, MISSING_OPERATOR,
  MISSING_SEGMENT_CLOSE, MULTIPLE_ANTE_CONTEXTS, MULTIPLE_CURSORS, MULTIPLE_POST_CONTEXTS,
  TRAILING_BACKSLASH, UNDEFINED_SEGMENT_REFERENCE, UNDEFINED_VARIABLE, UNQUOTED_SPECIAL,
  UNTERMINATED_QUOTE
}
 Parse error codes generated by RuleBasedTransliterator. More...


Public Methods

 RuleBasedTransliterator (const UnicodeString& id, const UnicodeString& rules, UTransDirection direction, UnicodeFilter* adoptedFilter, UParseError& parseError, UErrorCode& status)
 Constructs a new transliterator from the given rules. More...

 RuleBasedTransliterator (const UnicodeString& id, const UnicodeString& rules, UTransDirection direction, UnicodeFilter* adoptedFilter, UErrorCode& status)
 Constructs a new transliterator from the given rules. More...

 RuleBasedTransliterator (const UnicodeString& id, const UnicodeString& rules, UTransDirection direction, UErrorCode& status)
 Covenience constructor with no filter. More...

 RuleBasedTransliterator (const UnicodeString& id, const UnicodeString& rules, UErrorCode& status)
 Covenience constructor with no filter and FORWARD direction. More...

 RuleBasedTransliterator (const UnicodeString& id, const UnicodeString& rules, UnicodeFilter* adoptedFilter, UErrorCode& status)
 Covenience constructor with FORWARD direction. More...

 RuleBasedTransliterator (const UnicodeString& id, const TransliterationRuleData* theData, UnicodeFilter* adoptedFilter = 0)
 Covenience constructor. More...

 RuleBasedTransliterator (const RuleBasedTransliterator&)
 Copy constructor. More...

virtual ~RuleBasedTransliterator ()
Transliteratorclone (void) const
 Implement Transliterator API. More...

virtual void handleTransliterate (Replaceable& text, UTransPosition& offsets, UBool isIncremental) const
 Implements. More...


Private Methods

void _construct (const UnicodeString& rules, UTransDirection direction, UErrorCode& status, UParseError* parseError = 0)

Private Attributes

TransliterationRuleData* data
 The data object is immutable, so we can freely share it with other instances of RBT, as long as we do NOT own this object. More...

UBool isDataOwned
 If true, we own the data object and must delete it. More...


Detailed Description

RuleBasedTransliterator is a transliterator that reads a set of rules in order to determine how to perform translations.

Rule sets are stored in resource bundles indexed by name. Rules within a rule set are separated by semicolons (';'). To include a literal semicolon, prefix it with a backslash ('\'). Whitespace, as defined by Character.isWhitespace(), is ignored. If the first non-blank character on a line is '#', the entire line is ignored as a comment.

Each set of rules consists of two groups, one forward, and one reverse. This is a convention that is not enforced; rules for one direction may be omitted, with the result that translations in that direction will not modify the source text. In addition, bidirectional forward-reverse rules may be specified for symmetrical transformations.

Rule syntax

Rule statements take one of the following forms:

$alefmadda=\u0622;
Variable definition. The name on the left is assigned the text on the right. In this example, after this statement, instances of the left hand name, "$alefmadda", will be replaced by the Unicode character U+0622. Variable names must begin with a letter and consist only of letters, digits, and underscores. Case is significant. Duplicate names cause an exception to be thrown, that is, variables cannot be redefined. The right hand side may contain well-formed text of any length, including no text at all ("$empty=;"). The right hand side may contain embedded UnicodeSet patterns, for example, "$softvowel=[eiyEIY]".
ai>$alefmadda;
Forward translation rule. This rule states that the string on the left will be changed to the string on the right when performing forward transliteration.
ai<$alefmadda;
Reverse translation rule. This rule states that the string on the right will be changed to the string on the left when performing reverse transliteration.

ai<>$alefmadda;
Bidirectional translation rule. This rule states that the string on the right will be changed to the string on the left when performing forward transliteration, and vice versa when performing reverse transliteration.

Translation rules consist of a match pattern and an output string. The match pattern consists of literal characters, optionally preceded by context, and optionally followed by context. Context characters, like literal pattern characters, must be matched in the text being transliterated. However, unlike literal pattern characters, they are not replaced by the output text. For example, the pattern "abc{def}" indicates the characters "def" must be preceded by "abc" for a successful match. If there is a successful match, "def" will be replaced, but not "abc". The final '}' is optional, so "abc{def" is equivalent to "abc{def}". Another example is "{123}456" (or "123}456") in which the literal pattern "123" must be followed by "456".

The output string of a forward or reverse rule consists of characters to replace the literal pattern characters. If the output string contains the character '|', this is taken to indicate the location of the cursor after replacement. The cursor is the point in the text at which the next replacement, if any, will be applied. The cursor is usually placed within the replacement text; however, it can actually be placed into the precending or following context by using the special character '</code>'. Examples:

a {foo} z > | @ bar; # foo -> bar, move cursor before a
{foo} xyz > bar @|; # foo -> bar, cursor between y and z

UnicodeSet

UnicodeSet patterns may appear anywhere that makes sense. They may appear in variable definitions. Contrariwise, UnicodeSet patterns may themselves contain variable references, such as "$a=[a-z];$not_a=[^$a]", or "$range=a-z;$ll=[$range]".

UnicodeSet patterns may also be embedded directly into rule strings. Thus, the following two rules are equivalent:

$vowel=[aeiou]; $vowel>'*'; # One way to do this
[aeiou]>'*'; # Another way

See UnicodeSet for more documentation and examples.

Segments

Segments of the input string can be matched and copied to the output string. This makes certain sets of rules simpler and more general, and makes reordering possible. For example:

([a-z]) > $1 $1; # double lowercase letters
([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs

The segment of the input string to be copied is delimited by "(" and ")". Up to nine segments may be defined. Segments may not overlap. In the output string, "$1" through "$9" represent the input string segments, in left-to-right order of definition.

Anchors

Patterns can be anchored to the beginning or the end of the text. This is done with the special characters '^' and '$'. For example:

^ a > 'BEG_A'; # match 'a' at start of text
a > 'A'; # match other instances of 'a'
z $ > 'END_Z'; # match 'z' at end of text
z > 'Z'; # match other instances of 'z'

It is also possible to match the beginning or the end of the text using a UnicodeSet. This is done by including a virtual anchor character '$' at the end of the set pattern. Although this is usually the match chafacter for the end anchor, the set will match either the beginning or the end of the text, depending on its placement. For example:

$x = [a-z$]; # match 'a' through 'z' OR anchor
$x 1 > 2; # match '1' after a-z or at the start
3 $x > 4; # match '3' before a-z or at the end

Example

The following example rules illustrate many of the features of the rule language.

Rule 1. abc{def}>x|y
Rule 2. xyz>r
Rule 3. yz>q

Applying these rules to the string "adefabcdefz" yields the following results:

|adefabcdefz Initial state, no rules match. Advance cursor.
a|defabcdefz Still no match. Rule 1 does not match because the preceding context is not present.
ad|efabcdefz Still no match. Keep advancing until there is a match...
ade|fabcdefz ...
adef|abcdefz ...
adefa|bcdefz ...
adefab|cdefz ...
adefabc|defz Rule 1 matches; replace "def" with "xy" and back up the cursor to before the 'y'.
adefabcx|yz Although "xyz" is present, rule 2 does not match because the cursor is before the 'y', not before the 'x'. Rule 3 does match. Replace "yz" with "q".
adefabcxq| The cursor is at the end; transliteration is complete.

The order of rules is significant. If multiple rules may match at some point, the first matching rule is applied.

Forward and reverse rules may have an empty output string. Otherwise, an empty left or right hand side of any statement is a syntax error.

Single quotes are used to quote any character other than a digit or letter. To specify a single quote itself, inside or outside of quotes, use two single quotes in a row. For example, the rule "'>'>o''clock" changes the string ">" to the string "o'clock".

Notes

While a RuleBasedTransliterator is being built, it checks that the rules are added in proper order. For example, if the rule "a>x" is followed by the rule "ab>y", then the second rule will throw an exception. The reason is that the second rule can never be triggered, since the first rule always matches anything it matches. In other words, the first rule masks the second rule.

Author(s):
Alan Liu
Draft:

Definition at line 280 of file rbt.h.


Member Enumeration Documentation

anonymous enum
 

Parse error codes generated by RuleBasedTransliterator.

See parseerr.h.

Enumeration values:
PARSE_ERROR_BASE  
BAD_VARIABLE_DEFINITION  
MALFORMED_RULE  
MALFORMED_SET  
MALFORMED_SYMBOL_REFERENCE  
MALFORMED_UNICODE_ESCAPE  
MALFORMED_VARIABLE_DEFINITION  
MALFORMED_VARIABLE_REFERENCE  
MISMATCHED_SEGMENT_DELIMITERS  
MISPLACED_ANCHOR_START  
MISPLACED_CURSOR_OFFSET  
MISSING_OPERATOR  
MISSING_SEGMENT_CLOSE  
MULTIPLE_ANTE_CONTEXTS  
MULTIPLE_CURSORS  
MULTIPLE_POST_CONTEXTS  
TRAILING_BACKSLASH  
UNDEFINED_SEGMENT_REFERENCE  
UNDEFINED_VARIABLE  
UNQUOTED_SPECIAL  
UNTERMINATED_QUOTE  

Definition at line 382 of file rbt.h.


Constructor & Destructor Documentation

RuleBasedTransliterator::RuleBasedTransliterator ( const UnicodeString & id,
const UnicodeString & rules,
UTransDirection direction,
UnicodeFilter * adoptedFilter,
UParseError & parseError,
UErrorCode & status ) [inline]
 

Constructs a new transliterator from the given rules.

Parameters:
rules   rules, separated by ';'
direction   either FORWARD or REVERSE.
Exceptions:
IllegalArgumentException   if rules are malformed or direction is invalid.
Draft:

Definition at line 421 of file rbt.h.

RuleBasedTransliterator::RuleBasedTransliterator ( const UnicodeString & id,
const UnicodeString & rules,
UTransDirection direction,
UnicodeFilter * adoptedFilter,
UErrorCode & status ) [inline]
 

Constructs a new transliterator from the given rules.

Parameters:
rules   rules, separated by ';'
direction   either FORWARD or REVERSE.
Exceptions:
IllegalArgumentException   if rules are malformed or direction is invalid.

Definition at line 439 of file rbt.h.

RuleBasedTransliterator::RuleBasedTransliterator ( const UnicodeString & id,
const UnicodeString & rules,
UTransDirection direction,
UErrorCode & status ) [inline]
 

Covenience constructor with no filter.

Draft:

Definition at line 452 of file rbt.h.

RuleBasedTransliterator::RuleBasedTransliterator ( const UnicodeString & id,
const UnicodeString & rules,
UErrorCode & status ) [inline]
 

Covenience constructor with no filter and FORWARD direction.

Draft:

Definition at line 464 of file rbt.h.

RuleBasedTransliterator::RuleBasedTransliterator ( const UnicodeString & id,
const UnicodeString & rules,
UnicodeFilter * adoptedFilter,
UErrorCode & status ) [inline]
 

Covenience constructor with FORWARD direction.

Draft:

Definition at line 475 of file rbt.h.

RuleBasedTransliterator::RuleBasedTransliterator ( const UnicodeString & id,
const TransliterationRuleData * theData,
UnicodeFilter * adoptedFilter = 0 )
 

Covenience constructor.

Draft:

RuleBasedTransliterator::RuleBasedTransliterator ( const RuleBasedTransliterator & )
 

Copy constructor.

Draft:

virtual RuleBasedTransliterator::~RuleBasedTransliterator ( ) [virtual]
 


Member Function Documentation

void RuleBasedTransliterator::_construct ( const UnicodeString & rules,
UTransDirection direction,
UErrorCode & status,
UParseError * parseError = 0 ) [private]
 

Referenced by RuleBasedTransliterator().

Transliterator * RuleBasedTransliterator::clone ( void ) const [virtual]
 

Implement Transliterator API.

Draft:

Reimplemented from Transliterator.

void RuleBasedTransliterator::handleTransliterate ( Replaceable & text,
UTransPosition & offsets,
UBool isIncremental ) const [virtual]
 

Implements.

Transliterator#handleTransliterate.

Draft:

Reimplemented from Transliterator.


Member Data Documentation

TransliterationRuleData * RuleBasedTransliterator::data [private]
 

The data object is immutable, so we can freely share it with other instances of RBT, as long as we do NOT own this object.

Definition at line 286 of file rbt.h.

UBool RuleBasedTransliterator::isDataOwned [private]
 

If true, we own the data object and must delete it.

Definition at line 291 of file rbt.h.


The documentation for this class was generated from the following file:
Generated at Fri Dec 15 12:13:44 2000 for ICU 1.7 by doxygen1.2.3 written by Dimitri van Heesch, © 1997-2000