com.ibm.text
Class DictionaryBasedBreakIterator
java.lang.Object
|
+--com.ibm.text.BreakIterator
|
+--com.ibm.text.RuleBasedBreakIterator
|
+--com.ibm.text.DictionaryBasedBreakIterator
- All Implemented Interfaces:
- java.lang.Cloneable
- public class DictionaryBasedBreakIterator
- extends RuleBasedBreakIterator
A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary
to further subdivide ranges of text beyond what is possible using just the
state-table-based algorithm. This is necessary, for example, to handle
word and line breaking in Thai, which doesn't use spaces between words. The
state-table-based algorithm used by RuleBasedBreakIterator is used to divide
up text as far as possible, and then contiguous ranges of letters are
repeatedly compared against a list of known words (i.e., the dictionary)
to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator,
but adds one more special substitution name: _dictionary_. This substitution
name is used to identify characters in words in the dictionary. The idea is that
if the iterator passes over a chunk of text that includes two or more characters
in a row that are included in _dictionary_, it goes back through that range and
derives additional break positions (if possible) using the dictionary.
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary
file. It uses Class.getResource() to locate the dictionary file. The
dictionary file is in a serialized binary format. We have a very primitive (and
slow) BuildDictionaryFile utility for creating dictionary files, but aren't
currently making it public. Contact us for help.
Inner Class Summary |
protected class |
DictionaryBasedBreakIterator.Builder
The Builder class for DictionaryBasedBreakIterator inherits almost all of
its functionality from the Builder class for RuleBasedBreakIterator, but
extends it with extra logic to handle the DICTIIONARY_VAR token |
Constructor Summary |
DictionaryBasedBreakIterator(java.lang.String description,
java.io.InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator. |
Method Summary |
int |
first()
Sets the current iteration position to the beginning of the text. |
int |
following(int offset)
Sets the current iteration position to the first boundary position after
the specified position. |
protected int |
handleNext()
This is the implementation function for next(). |
int |
last()
Sets the current iteration position to the end of the text. |
protected int |
lookupCategory(char c)
Looks up a character category for a character. |
protected RuleBasedBreakIterator.Builder |
makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator. |
int |
preceding(int offset)
Sets the current iteration position to the last boundary position
before the specified position. |
int |
previous()
Advances the iterator one step backwards. |
void |
setText(java.text.CharacterIterator newText)
Set the iterator to analyze a new piece of text. |
void |
writeTablesToFile(java.io.FileOutputStream file,
boolean littleEndian)
|
Methods inherited from class com.ibm.text.RuleBasedBreakIterator |
checkOffset, clone, current, debugPrintln, equals, getText, handlePrevious, hashCode, isBoundary, lookupBackwardState, lookupState, next, next, toString, writeSwappedInt, writeSwappedShort |
Methods inherited from class java.lang.Object |
finalize, getClass, notify, notifyAll, wait, wait, wait |
DictionaryBasedBreakIterator
public DictionaryBasedBreakIterator(java.lang.String description,
java.io.InputStream dictionaryStream)
throws java.io.IOException
- Constructs a DictionaryBasedBreakIterator.
- Parameters:
description
- Same as the description parameter on RuleBasedBreakIterator,
except for the special meaning of DICTIONARY_VAR. This parameter is just
passed through to RuleBasedBreakIterator's constructor.dictionaryFilename
- The filename of the dictionary file to use
makeBuilder
protected RuleBasedBreakIterator.Builder makeBuilder()
- Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
This is the same as RuleBasedBreakIterator.Builder, except for the extra code
to handle the DICTIONARY_VAR tag.
- Overrides:
makeBuilder
in class RuleBasedBreakIterator
writeTablesToFile
public void writeTablesToFile(java.io.FileOutputStream file,
boolean littleEndian)
throws java.io.IOException
- Overrides:
writeTablesToFile
in class RuleBasedBreakIterator
setText
public void setText(java.text.CharacterIterator newText)
- Description copied from class:
RuleBasedBreakIterator
- Set the iterator to analyze a new piece of text. This function resets
the current iteration position to the beginning of the text.
- Overrides:
setText
in class RuleBasedBreakIterator
- Following copied from class:
com.ibm.text.RuleBasedBreakIterator
- Parameters:
newText
- An iterator over the text to analyze.
first
public int first()
- Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
- Overrides:
first
in class RuleBasedBreakIterator
- Returns:
- The offset of the beginning of the text.
last
public int last()
- Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
- Overrides:
last
in class RuleBasedBreakIterator
- Returns:
- The text's past-the-end offset.
previous
public int previous()
- Advances the iterator one step backwards.
- Overrides:
previous
in class RuleBasedBreakIterator
- Returns:
- The position of the last boundary position before the
current iteration position
preceding
public int preceding(int offset)
- Sets the current iteration position to the last boundary position
before the specified position.
- Overrides:
preceding
in class RuleBasedBreakIterator
- Parameters:
offset
- The position to begin searching from- Returns:
- The position of the last boundary before "offset"
following
public int following(int offset)
- Sets the current iteration position to the first boundary position after
the specified position.
- Overrides:
following
in class RuleBasedBreakIterator
- Parameters:
offset
- The position to begin searching forward from- Returns:
- The position of the first boundary after "offset"
handleNext
protected int handleNext()
- This is the implementation function for next().
- Overrides:
handleNext
in class RuleBasedBreakIterator
lookupCategory
protected int lookupCategory(char c)
- Looks up a character category for a character.
- Overrides:
lookupCategory
in class RuleBasedBreakIterator
Copyright (c) 1998-2000 IBM Corporation and others.