#include <rbbi_bld.h>
Public Methods | |
RuleBasedBreakIteratorBuilder (RuleBasedBreakIterator& iteratorToBuild) | |
The Builder class contains a reference to the iterator it's supposed to build. | |
~RuleBasedBreakIteratorBuilder () | |
Destructor. | |
virtual void | buildBreakIterator (const UnicodeString& description, UErrorCode& err) |
This is the main function for setting up the BreakIterator's tables. More... | |
Protected Methods | |
virtual void | processSubstitution (UnicodeString& description, UTextOffset ruleStart, UTextOffset ruleEnd, UTextOffset startPos, UErrorCode& err) |
This function performs variable-name substitutions. More... | |
virtual void | handleSpecialSubstitution (const UnicodeString& replace, const UnicodeString& replaceWith, int32_t startPos, const UnicodeString& description, UErrorCode& err) |
This function defines a protocol for handling substitution names that are "special," i.e., that have some property beyond just being substitutions. More... | |
virtual void | mungeExpressionList () |
This function provides a hook for subclasses to mess with the character category table. | |
virtual void | buildCharCategories (UErrorCode& err) |
This function builds the character category table. More... | |
virtual void | setUpErrorMessage (const UnicodeString& message, int32_t position, const UnicodeString& context) |
Throws an IllegalArgumentException representing a syntax error in the rule description. More... | |
Protected Attributes | |
RuleBasedBreakIterator& | iterator |
The iterator we're constructing. More... | |
RuleBasedBreakIteratorTables* | tables |
The tables object for the iterator we're constructing. More... | |
UVector | tempRuleList |
A temporary place to hold the rules as they're being processed. More... | |
UVector | categories |
A temporary holding place used for calculating the character categories. More... | |
int32_t | numCategories |
The number of categories (and thus the number of columns in the finished state tables). More... | |
ExpressionList* | expressions |
A table used to map parts of regexp text to lists of character categories, rather than having to figure them out from scratch each time. More... | |
UnicodeSet | ignoreChars |
A temporary holding place for the list of ignore characters. More... | |
UVector | tempStateTable |
A temporary holding place where the forward state table is built. More... | |
UVector | decisionPointList |
A list of all the states that have to be filled in with transitions to the next state that is created. More... | |
UStack | decisionPointStack |
A UStack for holding decision point lists. More... | |
UVector | loopingStates |
A list of states that loop back on themselves. More... | |
UVector | statesToBackfill |
Looping states actually have to be backfilled later in the process than everything else. More... | |
UVector | mergeList |
A list mapping pairs of state numbers for states that are to be combined to the state number of the state representing their combination. More... | |
UBool | clearLoopingStates |
A flag that is used to indicate when the list of looping states can be reset. More... | |
UnicodeString | errorMessage |
A place where an error message can be stored if we get a parse error. More... | |
Static Protected Attributes | |
const int32_t | END_STATE_FLAG |
A bit mask used to indicate a bit in the table's flags column that marks a state as an accepting state. More... | |
const int32_t | DONT_LOOP_FLAG |
A bit mask used to indicate a bit in the table's flags column that marks a state as one the builder shouldn't loop to any looping states. More... | |
const int32_t | LOOKAHEAD_STATE_FLAG |
A bit mask used to indicate a bit in the table's flags column that marks a state as a lookahead state. More... | |
const int32_t | ALL_FLAGS |
A bit mask representing the union of the mask values listed above. More... |
A Builder is constructed by RuleBasedBreakIterator's constructor, which uses it to construct the iterator itself and then throws it away.
The construction logic is separated out into its own class for two primary reasons:
It'd be really nice if this could be an independent class rather than an inner class, because that would shorten the source file considerably, but making Builder an inner class of RuleBasedBreakIterator allows it direct access to RuleBasedBreakIterator's private members, which saves us from having to provide some kind of "back door" to the Builder class that could then also be used by other classes.
Definition at line 44 of file rbbi_bld.h.
|
The Builder class contains a reference to the iterator it's supposed to build.
|
|
Destructor.
|
|
This is the main function for setting up the BreakIterator's tables. It just vectors different parts of the job off to other functions. |
|
This function builds the character category table. On entry, tempRuleList is a UVector of break rules that has had variable names substituted. On exit, the charCategoryTable data member has been initialized to hold the character category table, and tempRuleList's rules have been munged to contain character category numbers everywhere a literal character or a [] expression originally occurred. |
|
This function defines a protocol for handling substitution names that are "special," i.e., that have some property beyond just being substitutions. At the RuleBasedBreakIterator level, we have one special substitution name, "<ignore>". Subclasses can override this function to add more. Any special processing that has to go on beyond that which is done by the normal substitution-processing code is done here. |
|
This function provides a hook for subclasses to mess with the character category table.
|
|
This function performs variable-name substitutions. First it does syntax checking on the variable-name definition. If it's syntactically valid, it then goes through the remainder of the description and does a simple find-and-replace of the variable name with its text. (The variable text must be enclosed in either [] or () for this to work.) |
|
Throws an IllegalArgumentException representing a syntax error in the rule description. The exception's message contains some debugging information.
|
|
A bit mask representing the union of the mask values listed above. Used for clearing or masking off the flag bits. Definition at line 158 of file rbbi_bld.h. |
|
A bit mask used to indicate a bit in the table's flags column that marks a state as one the builder shouldn't loop to any looping states.
Definition at line 145 of file rbbi_bld.h. |
|
A bit mask used to indicate a bit in the table's flags column that marks a state as an accepting state.
Definition at line 139 of file rbbi_bld.h. |
|
A bit mask used to indicate a bit in the table's flags column that marks a state as a lookahead state.
Definition at line 151 of file rbbi_bld.h. |
|
A temporary holding place used for calculating the character categories. This object contains UnicodeSet objects. Definition at line 66 of file rbbi_bld.h. |
|
A flag that is used to indicate when the list of looping states can be reset.
Definition at line 126 of file rbbi_bld.h. |
|
A list of all the states that have to be filled in with transitions to the next state that is created. Used when building the state table from the regular expressions. Definition at line 94 of file rbbi_bld.h. |
|
A UStack for holding decision point lists. This is used to handle nested parentheses and braces in regexps. Definition at line 100 of file rbbi_bld.h. |
|
A place where an error message can be stored if we get a parse error. The error message is never displayed anywhere, so this is useful pretty much only in conjunction with a debugger. Definition at line 133 of file rbbi_bld.h. |
|
A table used to map parts of regexp text to lists of character categories, rather than having to figure them out from scratch each time.
Definition at line 77 of file rbbi_bld.h. |
|
A temporary holding place for the list of ignore characters.
Definition at line 82 of file rbbi_bld.h. |
|
The iterator we're constructing.
Definition at line 50 of file rbbi_bld.h. |
|
A list of states that loop back on themselves. Used to handle .*? Definition at line 105 of file rbbi_bld.h. |
|
A list mapping pairs of state numbers for states that are to be combined to the state number of the state representing their combination. Used in the process of making the state table deterministic to prevent infinite recursion. Definition at line 120 of file rbbi_bld.h. |
|
The number of categories (and thus the number of columns in the finished state tables).
Definition at line 71 of file rbbi_bld.h. |
|
Looping states actually have to be backfilled later in the process than everything else. This is where a the list of states to backfill is accumulated. This is also used to handle .*? Definition at line 112 of file rbbi_bld.h. |
|
The tables object for the iterator we're constructing.
Definition at line 55 of file rbbi_bld.h. |
|
A temporary place to hold the rules as they're being processed.
Definition at line 60 of file rbbi_bld.h. |
|
A temporary holding place where the forward state table is built.
Definition at line 87 of file rbbi_bld.h. |