Main Page   Class Hierarchy   Alphabetical List   Compound List   File List   Header Files   Compound Members   File Members  

RuleBasedTransliterator Class Reference

A transliterator that reads a set of rules in order to determine how to perform translations. More...

#include <rbt.h>

Class diagram for RuleBasedTransliterator:

Transliterator

List of all members.


Public Members

 RuleBasedTransliterator (const UnicodeString& ID, const UnicodeString& rules, Direction direction, UnicodeFilter* adoptedFilter, UErrorCode& status)
Constructs a new transliterator from the given rules. More...

 RuleBasedTransliterator (const UnicodeString& ID, const UnicodeString& rules, Direction direction, UErrorCode& status)
Covenience constructor with no filter. More...

 RuleBasedTransliterator (const UnicodeString& ID, const UnicodeString& rules, UErrorCode& status)
Covenience constructor with no filter and FORWARD direction. More...

 RuleBasedTransliterator (const UnicodeString& ID, const UnicodeString& rules, UnicodeFilter* adoptedFilter, UErrorCode& status)
Covenience constructor with FORWARD direction. More...

 RuleBasedTransliterator (const UnicodeString& ID, const TransliterationRuleData* theData, UnicodeFilter* adoptedFilter = 0)
Covenience constructor. More...

 RuleBasedTransliterator (const RuleBasedTransliterator&)
Copy constructor. More...

virtual ~RuleBasedTransliterator ()
Transliteratorclone (void) const
Implement Transliterator API. More...

virtual void handleTransliterate (Replaceable& text, Position& offsets, bool_t isIncremental) const
Implements. More...


Detailed Description

A transliterator that reads a set of rules in order to determine how to perform translations.

Rules are stored in resource bundles indexed by name. Rules are separated by semicolons (';'). To include a literal semicolon, prefix it with a backslash ('\;'). Whitespace, as defined by Character.isWhitespace(), is ignored. If the first non-blank character on a line is '#', the entire line is ignored as a comment.

Each set of rules consists of two groups, one forward, and one reverse. This is a convention that is not enforced; rules for one direction may be omitted, with the result that translations in that direction will not modify the source text.

Rule syntax

Rule statements take one of the following forms:

alefmadda=\u0622
Variable definition. The name on the left is assigned the character or expression on the right. Names may not contain any special characters (see list below). Duplicate names (including duplicates of simple variables or category names) cause an exception to be thrown. If the right hand side consists of one character, then the variable stands for that character. In this example, after this statement, instances of the left hand name surrounded by braces, "{alefmadda}", will be replaced by the Unicode character U+0622. If the right hand side is longer than one character, then it is interpreted as a character category expression; see below for details.
softvowel=[eiyEIY]
Category definition. The name on the left is assigned to stand for a set of characters. The same rules for names of simple variables apply. After this statement, the left hand variable will be interpreted as indicating a set of characters in appropriate contexts. The pattern syntax defining sets of characters is defined by UnicodeSet. Examples of valid patterns are:
[abc] The set containing the characters 'a', 'b', and 'c'.
[^abc] The set of all characters except 'a', 'b', and 'c'.
[A-Z] The set of all characters from 'A' to 'Z' in Unicode order.
[:Lu:] The set of Unicode uppercase letters. See www.unicode.org for a complete list of categories and their two-letter codes.
[^a-z[:Lu:][:Ll:]] The set of all characters except 'a' through 'z' and uppercase or lowercase letters.

See UnicodeSet for more documentation and examples.

ai>{alefmadda}
Forward translation rule. This rule states that the string on the left will be changed to the string on the right when performing forward transliteration.
ai<{alefmadda}
Reverse translation rule. This rule states that the string on the right will be changed to the string on the left when performing reverse transliteration.

ai<>{alefmadda}
Bidirectional translation rule. This rule states that the string on the right will be changed to the string on the left when performing forward transliteration, and vice versa when performing reverse transliteration.

Forward and reverse translation rules consist of a match pattern and an output string. The match pattern consists of literal characters, optionally preceded by context, and optionally followed by context. Context characters, like literal pattern characters, must be matched in the text being transliterated. However, unlike literal pattern characters, they are not replaced by the output text. For example, the pattern "(abc)def" indicates the characters "def" must be preceded by "abc" for a successful match. If there is a successful match, "def" will be replaced, but not "abc". The initial '(' is optional, so "abc)def" is equivalent to "(abc)def". Another example is "123(456)" (or "123(456") in which the literal pattern "123" must be followed by "456".

The output string of a forward or reverse rule consists of characters to replace the literal pattern characters. If the output string contains the character '|', this is taken to indicate the location of the cursor after replacement. The cursor is the point in the text at which the next replacement, if any, will be applied.

In addition to being defined in variables, UnicodeSet patterns may be embedded directly into rule strings. Thus, the following two rules are equivalent:

vowel=[aeiou]; {vowel}>*; # One way to do this
[aeiou]>*; # Another way

Example

The following example rules illustrate many of the features of the rule language.

Rule 1. (abc)def>x|y
Rule 2. xyz>r
Rule 3. yz>q

Applying these rules to the string "adefabcdefz" yields the following results:

|adefabcdefz Initial state, no rules match. Advance cursor.
a|defabcdefz Still no match. Rule 1 does not match because the preceding context is not present.
ad|efabcdefz Still no match. Keep advancing until there is a match...
ade|fabcdefz ...
adef|abcdefz ...
adefa|bcdefz ...
adefab|cdefz ...
adefabc|defz Rule 1 matches; replace "def" with "xy" and back up the cursor to before the 'y'.
adefabcx|yz Although "xyz" is present, rule 2 does not match because the cursor is before the 'y', not before the 'x'. Rule 3 does match. Replace "yz" with "q".
adefabcxq| The cursor is at the end; transliteration is complete.

The order of rules is significant. If multiple rules may match at some point, the first matching rule is applied.

Forward and reverse rules may have an empty output string. Otherwise, an empty left or right hand side of any statement is a syntax error.

Single quotes are used to quote the special characters =><{}[]()|. To specify a single quote itself, inside or outside of quotes, use two single quotes in a row. For example, the rule "'>'>o''clock" changes the string ">" to the string "o'clock".

Notes

While a RuleBasedTransliterator is being built, it checks that the rules are added in proper order. For example, if the rule "a>x" is followed by the rule "ab>y", then the second rule will throw an exception. The reason is that the second rule can never be triggered, since the first rule always matches anything it matches. In other words, the first rule masks the second rule.

Author(s):
Alan Liu
Draft:

Member Function Documentation

RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, Direction direction, UnicodeFilter * adoptedFilter, UErrorCode & status) [inline]

Constructs a new transliterator from the given rules.

Parameters:
rules   rules, separated by ';'
direction   either FORWARD or REVERSE.
Exceptions:
IllegalArgumentException   if rules are malformed or direction is invalid.
Draft:

RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, Direction direction, UErrorCode & status) [inline]

Covenience constructor with no filter.

Draft:

RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, UErrorCode & status) [inline]

Covenience constructor with no filter and FORWARD direction.

Draft:

RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const UnicodeString & rules, UnicodeFilter * adoptedFilter, UErrorCode & status) [inline]

Covenience constructor with FORWARD direction.

Draft:

RuleBasedTransliterator::RuleBasedTransliterator (const UnicodeString & ID, const TransliterationRuleData * theData, UnicodeFilter * adoptedFilter = 0)

Covenience constructor.

Draft:

RuleBasedTransliterator::RuleBasedTransliterator (const RuleBasedTransliterator &)

Copy constructor.

Draft:

virtual RuleBasedTransliterator::~RuleBasedTransliterator () [virtual]

Transliterator * RuleBasedTransliterator::clone (void) const [virtual]

Implement Transliterator API.

Draft:

Reimplemented from Transliterator.

virtual void RuleBasedTransliterator::handleTransliterate (Replaceable & text, Position & offsets, bool_t isIncremental) const [virtual]

Implements.

Transliterator#handleTransliterate.

Draft:

Reimplemented from Transliterator.


The documentation for this class was generated from the following file:
Generated at Thu Feb 10 15:30:56 2000 for icu by doxygen 1.0.0 written by Dimitri van Heesch, © 1997-1999