Main Page   Class Hierarchy   Alphabetical List   Compound List   File List   Compound Members   File Members  

DictionaryBasedBreakIterator Class Reference

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. More...

#include <dbbi.h>

Inheritance diagram for DictionaryBasedBreakIterator:

RuleBasedBreakIterator BreakIterator List of all members.

Public Methods

virtual ~DictionaryBasedBreakIterator ()
 Destructor.

DictionaryBasedBreakIterator& operator= (const DictionaryBasedBreakIterator& that)
 Assignment operator. More...

virtual BreakIteratorclone (void) const
 Returns a newly-constructed RuleBasedBreakIterator with the same behavior, and iterating over the same text, as this one.

virtual int32_t previous (void)
 Advances the iterator backwards, to the last boundary preceding this one. More...

virtual int32_t following (int32_t offset)
 Sets the iterator to refer to the first boundary position following the specified position. More...

virtual int32_t preceding (int32_t offset)
 Sets the iterator to refer to the last boundary position before the specified position. More...

virtual UClassID getDynamicClassID (void) const
 Returns a unique class ID POLYMORPHICALLY. More...


Static Public Methods

UClassID getStaticClassID (void)
 Returns the class ID for this class. More...


Protected Methods

virtual int32_t handleNext (void)
 This method is the actual implementation of the next() method. More...

virtual void reset (void)
 dumps the cache of break positions (usually in response to a change in position of some sort).

virtual BreakIteratorcreateBufferClone (void *stackBuffer, int32_t &BufferSize, UErrorCode &status)
 Thread safe client-buffer-based cloning operation Do NOT call delete on a safeclone, since 'new' is not used to create it. More...


Private Methods

 DictionaryBasedBreakIterator (UDataMemory* tablesImage, char* dictionaryFilename, UErrorCode& status)
 ======================================================================= Create a dictionary based break boundary detection iterator. More...

void divideUpDictionaryRange (int32_t startPos, int32_t endPos)
 This is the function that actually implements the dictionary-based algorithm. More...

void bumpDictionaryCharCount (void)
 Used by the tables object to increment the count of dictionary characters during iteration. More...


Private Attributes

int32_t dictionaryCharCount
 a temporary hiding place for the number of dictionary characters in the last range passed over by next(). More...

int32_tcachedBreakPositions
 when a range of characters is divided up using the dictionary, the break positions that are discovered are stored here, preventing us from having to use either the dictionary or the state table again until the iterator leaves this range of text. More...

int32_t numCachedBreakPositions
 The number of elements in cachedBreakPositions. More...

int32_t positionInCache
 if cachedBreakPositions is not null, this indicates which item in the cache the current iteration position refers to. More...


Static Private Attributes

char fgClassID
 Class ID. More...


Friends

class  DictionaryBasedBreakIteratorTables
class  BreakIterator

Detailed Description

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm.

This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words.

DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name: . This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in , it goes back through that range and derives additional break positions (if possible) using the dictionary.

DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It follows a prescribed search path to locate the dictionary (right now, it looks for it in /com/ibm/text/resources in each directory in the classpath, and won't find it in JAR files, but this location is likely to change). The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

NOTE The DictionaryBasedIterator class is still under development. The APIs are not in stable condition yet.

Definition at line 47 of file dbbi.h.


Constructor & Destructor Documentation

DictionaryBasedBreakIterator::DictionaryBasedBreakIterator ( UDataMemory * tablesImage,
char * dictionaryFilename,
UErrorCode & status ) [private]
 

======================================================================= Create a dictionary based break boundary detection iterator.

Parameters:
tablesImage   The location for the dictionary to be loaded into memory
dictionaryFilename   The name of the dictionary file
status   the error code status
Returns:
A dictionary based break detection iterator. The UErrorCode& status parameter is used to return status information to the user. To check whether the construction succeeded or not, you should check the value of U_SUCCESS(err). If you wish more detailed information, you can check for informational error results which still indicate success. For example, U_FILE_ACCESS_ERROR will be returned if the file does not exist. The caller owns the returned object and is responsible for deleting it. =======================================================================

DictionaryBasedBreakIterator::~DictionaryBasedBreakIterator ( ) [virtual]
 

Destructor.


Member Function Documentation

void DictionaryBasedBreakIterator::bumpDictionaryCharCount ( void ) [inline, private]
 

Used by the tables object to increment the count of dictionary characters during iteration.

Definition at line 224 of file dbbi.h.

BreakIterator * DictionaryBasedBreakIterator::clone ( void ) const [virtual]
 

Returns a newly-constructed RuleBasedBreakIterator with the same behavior, and iterating over the same text, as this one.

Reimplemented from RuleBasedBreakIterator.

virtual BreakIterator* DictionaryBasedBreakIterator::createBufferClone ( void * stackBuffer,
int32_t & BufferSize,
UErrorCode & status ) [protected, virtual]
 

Thread safe client-buffer-based cloning operation Do NOT call delete on a safeclone, since 'new' is not used to create it.

Parameters:
stackBuffer   user allocated space for the new clone. If NULL new memory will be allocated. If buffer is not large enough, new memory will be allocated.
BufferSize   reference to size of allocated space. If BufferSize == 0, a sufficient size for use in cloning will be returned ('pre-flighting') If BufferSize is not enough for a stack-based safe clone, new memory will be allocated.
status   to indicate whether the operation went on smoothly or there were errors An informational status value, U_SAFECLONE_ALLOCATED_ERROR, is used if any allocations were necessary.
Returns:
pointer to the new clone

Draft:
API 1.8 freeze

Reimplemented from RuleBasedBreakIterator.

void DictionaryBasedBreakIterator::divideUpDictionaryRange ( int32_t startPos,
int32_t endPos ) [private]
 

This is the function that actually implements the dictionary-based algorithm.

Given the endpoints of a range of text, it uses the dictionary to determine the positions of any boundaries in this range. It stores all the boundary positions it discovers in cachedBreakPositions so that we only have to do this work once for each time we enter the range.

int32_t DictionaryBasedBreakIterator::following ( int32_t offset ) [virtual]
 

Sets the iterator to refer to the first boundary position following the specified position.

@offset The position from which to begin searching for a break position.

Returns:
The position of the first break after the current position.

Reimplemented from RuleBasedBreakIterator.

UClassID DictionaryBasedBreakIterator::getDynamicClassID ( void ) const [inline, virtual]
 

Returns a unique class ID POLYMORPHICALLY.

Pure virtual override. This method is to implement a simple version of RTTI, since not all C++ compilers support genuine RTTI. Polymorphic operator==() and clone() methods call this method.

Returns:
The class ID for this object. All objects of a given class have the same class ID. Objects of other classes have different class IDs.

Reimplemented from RuleBasedBreakIterator.

Definition at line 216 of file dbbi.h.

UClassID DictionaryBasedBreakIterator::getStaticClassID ( void ) [inline, static]
 

Returns the class ID for this class.

This is useful only for comparing to a return value from getDynamicClassID(). For example:

Base* polymorphic_pointer = createPolymorphicObject(); if (polymorphic_pointer->getDynamicClassID() == Derived::getStaticClassID()) ...

Returns:
The class ID for all objects of this class.

Reimplemented from RuleBasedBreakIterator.

Definition at line 220 of file dbbi.h.

int32_t DictionaryBasedBreakIterator::handleNext ( void ) [protected, virtual]
 

This method is the actual implementation of the next() method.

All iteration vectors through here. This method initializes the state machine to state 1 and advances through the text character by character until we reach the end of the text or the state machine transitions to state 0. We update our return value every time the state machine passes through a possible end state.

Reimplemented from RuleBasedBreakIterator.

DictionaryBasedBreakIterator & DictionaryBasedBreakIterator::operator= ( const DictionaryBasedBreakIterator & that )
 

Assignment operator.

Sets this iterator to have the same behavior, and iterate over the same text, as the one passed in.

int32_t DictionaryBasedBreakIterator::preceding ( int32_t offset ) [virtual]
 

Sets the iterator to refer to the last boundary position before the specified position.

@offset The position to begin searching for a break from.

Returns:
The position of the last boundary before the starting position.

Reimplemented from RuleBasedBreakIterator.

int32_t DictionaryBasedBreakIterator::previous ( void ) [virtual]
 

Advances the iterator backwards, to the last boundary preceding this one.

Returns:
The position of the last boundary position preceding this one.

Reimplemented from RuleBasedBreakIterator.

void DictionaryBasedBreakIterator::reset ( void ) [protected, virtual]
 

dumps the cache of break positions (usually in response to a change in position of some sort).

Reimplemented from RuleBasedBreakIterator.


Friends And Related Function Documentation

class BreakIterator [friend]
 

Reimplemented from RuleBasedBreakIterator.

Definition at line 213 of file dbbi.h.

class DictionaryBasedBreakIteratorTables [friend]
 

Definition at line 212 of file dbbi.h.


Member Data Documentation

int32_t * DictionaryBasedBreakIterator::cachedBreakPositions [private]
 

when a range of characters is divided up using the dictionary, the break positions that are discovered are stored here, preventing us from having to use either the dictionary or the state table again until the iterator leaves this range of text.

Definition at line 62 of file dbbi.h.

int32_t DictionaryBasedBreakIterator::dictionaryCharCount [private]
 

a temporary hiding place for the number of dictionary characters in the last range passed over by next().

Definition at line 54 of file dbbi.h.

char DictionaryBasedBreakIterator::fgClassID [static, private]
 

Class ID.

Reimplemented from RuleBasedBreakIterator.

Definition at line 78 of file dbbi.h.

int32_t DictionaryBasedBreakIterator::numCachedBreakPositions [private]
 

The number of elements in cachedBreakPositions.

Definition at line 67 of file dbbi.h.

int32_t DictionaryBasedBreakIterator::positionInCache [private]
 

if cachedBreakPositions is not null, this indicates which item in the cache the current iteration position refers to.

Definition at line 73 of file dbbi.h.


The documentation for this class was generated from the following file:
Generated at Tue Jun 12 14:04:30 2001 for ICU 1.8.1 by doxygen1.2.3 written by Dimitri van Heesch, © 1997-2000