Main Page   Class Hierarchy   Alphabetical List   Compound List   File List   Compound Members   File Members  

RuleBasedBreakIteratorBuilder Class Reference

The Builder class has the job of constructing a RuleBasedBreakIterator from a textual description. More...

#include <rbbi_bld.h>

List of all members.

Public Methods

 RuleBasedBreakIteratorBuilder (RuleBasedBreakIterator& iteratorToBuild)
 The Builder class contains a reference to the iterator it's supposed to build.

 ~RuleBasedBreakIteratorBuilder ()
 Destructor.

virtual void buildBreakIterator (const UnicodeString& description, UErrorCode& err)
 This is the main function for setting up the BreakIterator's tables. More...


Protected Methods

virtual void processSubstitution (UnicodeString& description, UTextOffset ruleStart, UTextOffset ruleEnd, UTextOffset startPos, UErrorCode& err)
 This function performs variable-name substitutions. More...

virtual void handleSpecialSubstitution (const UnicodeString& replace, const UnicodeString& replaceWith, int32_t startPos, const UnicodeString& description, UErrorCode& err)
 This function defines a protocol for handling substitution names that are "special," i.e., that have some property beyond just being substitutions. More...

virtual void mungeExpressionList ()
 This function provides a hook for subclasses to mess with the character category table.

virtual void buildCharCategories (UErrorCode& err)
 This function builds the character category table. More...

virtual void setUpErrorMessage (const UnicodeString& message, int32_t position, const UnicodeString& context)
 Throws an IllegalArgumentException representing a syntax error in the rule description. More...


Protected Attributes

RuleBasedBreakIteratoriterator
 The iterator we're constructing. More...

RuleBasedBreakIteratorTablestables
 The tables object for the iterator we're constructing. More...

UVector tempRuleList
 A temporary place to hold the rules as they're being processed. More...

UVector categories
 A temporary holding place used for calculating the character categories. More...

int32_t numCategories
 The number of categories (and thus the number of columns in the finished state tables). More...

ExpressionList* expressions
 A table used to map parts of regexp text to lists of character categories, rather than having to figure them out from scratch each time. More...

UnicodeSet ignoreChars
 A temporary holding place for the list of ignore characters. More...

UVector tempStateTable
 A temporary holding place where the forward state table is built. More...

UVector decisionPointList
 A list of all the states that have to be filled in with transitions to the next state that is created. More...

UStack decisionPointStack
 A UStack for holding decision point lists. More...

UVector loopingStates
 A list of states that loop back on themselves. More...

UVector statesToBackfill
 Looping states actually have to be backfilled later in the process than everything else. More...

UVector mergeList
 A list mapping pairs of state numbers for states that are to be combined to the state number of the state representing their combination. More...

UBool clearLoopingStates
 A flag that is used to indicate when the list of looping states can be reset. More...

UnicodeString errorMessage
 A place where an error message can be stored if we get a parse error. More...


Static Protected Attributes

const int32_t END_STATE_FLAG
 A bit mask used to indicate a bit in the table's flags column that marks a state as an accepting state. More...

const int32_t DONT_LOOP_FLAG
 A bit mask used to indicate a bit in the table's flags column that marks a state as one the builder shouldn't loop to any looping states. More...

const int32_t LOOKAHEAD_STATE_FLAG
 A bit mask used to indicate a bit in the table's flags column that marks a state as a lookahead state. More...

const int32_t ALL_FLAGS
 A bit mask representing the union of the mask values listed above. More...


Detailed Description

The Builder class has the job of constructing a RuleBasedBreakIterator from a textual description.

A Builder is constructed by RuleBasedBreakIterator's constructor, which uses it to construct the iterator itself and then throws it away.

The construction logic is separated out into its own class for two primary reasons:

It'd be really nice if this could be an independent class rather than an inner class, because that would shorten the source file considerably, but making Builder an inner class of RuleBasedBreakIterator allows it direct access to RuleBasedBreakIterator's private members, which saves us from having to provide some kind of "back door" to the Builder class that could then also be used by other classes.

Definition at line 44 of file rbbi_bld.h.


Constructor & Destructor Documentation

RuleBasedBreakIteratorBuilder::RuleBasedBreakIteratorBuilder ( RuleBasedBreakIterator & iteratorToBuild )
 

The Builder class contains a reference to the iterator it's supposed to build.

RuleBasedBreakIteratorBuilder::~RuleBasedBreakIteratorBuilder ( )
 

Destructor.


Member Function Documentation

void RuleBasedBreakIteratorBuilder::buildBreakIterator ( const UnicodeString & description,
UErrorCode & err ) [virtual]
 

This is the main function for setting up the BreakIterator's tables.

It just vectors different parts of the job off to other functions.

void RuleBasedBreakIteratorBuilder::buildCharCategories ( UErrorCode & err ) [protected, virtual]
 

This function builds the character category table.

On entry, tempRuleList is a UVector of break rules that has had variable names substituted. On exit, the charCategoryTable data member has been initialized to hold the character category table, and tempRuleList's rules have been munged to contain character category numbers everywhere a literal character or a [] expression originally occurred.

void RuleBasedBreakIteratorBuilder::handleSpecialSubstitution ( const UnicodeString & replace,
const UnicodeString & replaceWith,
int32_t startPos,
const UnicodeString & description,
UErrorCode & err ) [protected, virtual]
 

This function defines a protocol for handling substitution names that are "special," i.e., that have some property beyond just being substitutions.

At the RuleBasedBreakIterator level, we have one special substitution name, "<ignore>". Subclasses can override this function to add more. Any special processing that has to go on beyond that which is done by the normal substitution-processing code is done here.

void RuleBasedBreakIteratorBuilder::mungeExpressionList ( ) [protected, virtual]
 

This function provides a hook for subclasses to mess with the character category table.

void RuleBasedBreakIteratorBuilder::processSubstitution ( UnicodeString & description,
UTextOffset ruleStart,
UTextOffset ruleEnd,
UTextOffset startPos,
UErrorCode & err ) [protected, virtual]
 

This function performs variable-name substitutions.

First it does syntax checking on the variable-name definition. If it's syntactically valid, it then goes through the remainder of the description and does a simple find-and-replace of the variable name with its text. (The variable text must be enclosed in either [] or () for this to work.)

void RuleBasedBreakIteratorBuilder::setUpErrorMessage ( const UnicodeString & message,
int32_t position,
const UnicodeString & context ) [protected, virtual]
 

Throws an IllegalArgumentException representing a syntax error in the rule description.

The exception's message contains some debugging information.

Parameters:
message   A message describing the problem
position   The position in the description where the problem was discovered
context   The string containing the error


Member Data Documentation

const int32_t RuleBasedBreakIteratorBuilder::ALL_FLAGS [static, protected]
 

A bit mask representing the union of the mask values listed above.

Used for clearing or masking off the flag bits.

Definition at line 158 of file rbbi_bld.h.

const int32_t RuleBasedBreakIteratorBuilder::DONT_LOOP_FLAG [static, protected]
 

A bit mask used to indicate a bit in the table's flags column that marks a state as one the builder shouldn't loop to any looping states.

Definition at line 145 of file rbbi_bld.h.

const int32_t RuleBasedBreakIteratorBuilder::END_STATE_FLAG [static, protected]
 

A bit mask used to indicate a bit in the table's flags column that marks a state as an accepting state.

Definition at line 139 of file rbbi_bld.h.

const int32_t RuleBasedBreakIteratorBuilder::LOOKAHEAD_STATE_FLAG [static, protected]
 

A bit mask used to indicate a bit in the table's flags column that marks a state as a lookahead state.

Definition at line 151 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::categories [protected]
 

A temporary holding place used for calculating the character categories.

This object contains UnicodeSet objects.

Definition at line 66 of file rbbi_bld.h.

UBool RuleBasedBreakIteratorBuilder::clearLoopingStates [protected]
 

A flag that is used to indicate when the list of looping states can be reset.

Definition at line 126 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::decisionPointList [protected]
 

A list of all the states that have to be filled in with transitions to the next state that is created.

Used when building the state table from the regular expressions.

Definition at line 94 of file rbbi_bld.h.

UStack RuleBasedBreakIteratorBuilder::decisionPointStack [protected]
 

A UStack for holding decision point lists.

This is used to handle nested parentheses and braces in regexps.

Definition at line 100 of file rbbi_bld.h.

UnicodeString RuleBasedBreakIteratorBuilder::errorMessage [protected]
 

A place where an error message can be stored if we get a parse error.

The error message is never displayed anywhere, so this is useful pretty much only in conjunction with a debugger.

Definition at line 133 of file rbbi_bld.h.

ExpressionList * RuleBasedBreakIteratorBuilder::expressions [protected]
 

A table used to map parts of regexp text to lists of character categories, rather than having to figure them out from scratch each time.

Definition at line 77 of file rbbi_bld.h.

UnicodeSet RuleBasedBreakIteratorBuilder::ignoreChars [protected]
 

A temporary holding place for the list of ignore characters.

Definition at line 82 of file rbbi_bld.h.

RuleBasedBreakIterator & RuleBasedBreakIteratorBuilder::iterator [protected]
 

The iterator we're constructing.

Definition at line 50 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::loopingStates [protected]
 

A list of states that loop back on themselves.

Used to handle .*?

Definition at line 105 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::mergeList [protected]
 

A list mapping pairs of state numbers for states that are to be combined to the state number of the state representing their combination.

Used in the process of making the state table deterministic to prevent infinite recursion.

Definition at line 120 of file rbbi_bld.h.

int32_t RuleBasedBreakIteratorBuilder::numCategories [protected]
 

The number of categories (and thus the number of columns in the finished state tables).

Definition at line 71 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::statesToBackfill [protected]
 

Looping states actually have to be backfilled later in the process than everything else.

This is where a the list of states to backfill is accumulated. This is also used to handle .*?

Definition at line 112 of file rbbi_bld.h.

RuleBasedBreakIteratorTables * RuleBasedBreakIteratorBuilder::tables [protected]
 

The tables object for the iterator we're constructing.

Definition at line 55 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::tempRuleList [protected]
 

A temporary place to hold the rules as they're being processed.

Definition at line 60 of file rbbi_bld.h.

UVector RuleBasedBreakIteratorBuilder::tempStateTable [protected]
 

A temporary holding place where the forward state table is built.

Definition at line 87 of file rbbi_bld.h.


The documentation for this class was generated from the following file:
Generated at Tue Dec 5 17:56:18 2000 for ICU by doxygen1.2.3 written by Dimitri van Heesch, © 1997-2000