com.ibm.text
Class DictionaryBasedBreakIterator

java.lang.Object
  |
  +--com.ibm.text.BreakIterator
        |
        +--com.ibm.text.RuleBasedBreakIterator
              |
              +--com.ibm.text.DictionaryBasedBreakIterator
All Implemented Interfaces:
java.lang.Cloneable

public class DictionaryBasedBreakIterator
extends RuleBasedBreakIterator

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.


Inner Class Summary
protected  class DictionaryBasedBreakIterator.Builder
          The Builder class for DictionaryBasedBreakIterator inherits almost all of its functionality from the Builder class for RuleBasedBreakIterator, but extends it with extra logic to handle the DICTIIONARY_VAR token
 
Inner classes inherited from class com.ibm.text.RuleBasedBreakIterator
RuleBasedBreakIterator.Builder
 
Fields inherited from class com.ibm.text.RuleBasedBreakIterator
IGNORE
 
Fields inherited from class com.ibm.text.BreakIterator
DONE
 
Constructor Summary
DictionaryBasedBreakIterator(java.lang.String description, java.io.InputStream dictionaryStream)
          Constructs a DictionaryBasedBreakIterator.
 
Method Summary
 int first()
          Sets the current iteration position to the beginning of the text.
 int following(int offset)
          Sets the current iteration position to the first boundary position after the specified position.
protected  int handleNext()
          This is the implementation function for next().
 int last()
          Sets the current iteration position to the end of the text.
protected  int lookupCategory(char c)
          Looks up a character category for a character.
protected  RuleBasedBreakIterator.Builder makeBuilder()
          Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
 int preceding(int offset)
          Sets the current iteration position to the last boundary position before the specified position.
 int previous()
          Advances the iterator one step backwards.
 void setText(java.text.CharacterIterator newText)
          Set the iterator to analyze a new piece of text.
 void writeTablesToFile(java.io.FileOutputStream file, boolean littleEndian)
           
 
Methods inherited from class com.ibm.text.RuleBasedBreakIterator
checkOffset, clone, current, debugPrintln, equals, getText, handlePrevious, hashCode, isBoundary, lookupBackwardState, lookupState, next, next, toString, writeSwappedInt, writeSwappedShort
 
Methods inherited from class com.ibm.text.BreakIterator
getAvailableLocales, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getSentenceInstance, getSentenceInstance, getWordInstance, getWordInstance, setText
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(java.lang.String description,
                                    java.io.InputStream dictionaryStream)
                             throws java.io.IOException
Constructs a DictionaryBasedBreakIterator.
Parameters:
description - Same as the description parameter on RuleBasedBreakIterator, except for the special meaning of DICTIONARY_VAR. This parameter is just passed through to RuleBasedBreakIterator's constructor.
dictionaryFilename - The filename of the dictionary file to use
Method Detail

makeBuilder

protected RuleBasedBreakIterator.Builder makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator. This is the same as RuleBasedBreakIterator.Builder, except for the extra code to handle the DICTIONARY_VAR tag.
Overrides:
makeBuilder in class RuleBasedBreakIterator

writeTablesToFile

public void writeTablesToFile(java.io.FileOutputStream file,
                              boolean littleEndian)
                       throws java.io.IOException
Overrides:
writeTablesToFile in class RuleBasedBreakIterator

setText

public void setText(java.text.CharacterIterator newText)
Description copied from class: RuleBasedBreakIterator
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.
Overrides:
setText in class RuleBasedBreakIterator
Following copied from class: com.ibm.text.RuleBasedBreakIterator
Parameters:
newText - An iterator over the text to analyze.

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
Overrides:
first in class RuleBasedBreakIterator
Returns:
The offset of the beginning of the text.

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
Overrides:
last in class RuleBasedBreakIterator
Returns:
The text's past-the-end offset.

previous

public int previous()
Advances the iterator one step backwards.
Overrides:
previous in class RuleBasedBreakIterator
Returns:
The position of the last boundary position before the current iteration position

preceding

public int preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.
Overrides:
preceding in class RuleBasedBreakIterator
Parameters:
offset - The position to begin searching from
Returns:
The position of the last boundary before "offset"

following

public int following(int offset)
Sets the current iteration position to the first boundary position after the specified position.
Overrides:
following in class RuleBasedBreakIterator
Parameters:
offset - The position to begin searching forward from
Returns:
The position of the first boundary after "offset"

handleNext

protected int handleNext()
This is the implementation function for next().
Overrides:
handleNext in class RuleBasedBreakIterator

lookupCategory

protected int lookupCategory(char c)
Looks up a character category for a character.
Overrides:
lookupCategory in class RuleBasedBreakIterator


Copyright (c) 1998-2000 IBM Corporation and others.