:: |
Within a nested [] expression, a pair of colons containing a one- or
two-letter code matches all characters in the corresponding Unicode category.
The :: expression has to be the only thing inside the [] expression. The two-letter codes
are the same as the two-letter codes in the Unicode database (for example,
"[[:Sc:][:Sm:]]" matches all currency symbols and all math symbols).
Specifying a one-letter code is the same as specifying all two-letter codes that begin
with that letter (for example, "[[:L:]]" matches all letters, and is equivalent
to "[[:Lu:][:Ll:][:Lo:][:Lm:][:Lt:]]"). Anything other than a valid
two-letter Unicode category code or a single letter that begins a valide Unicode category
code is illegal within the colons. |
| |
Two nested [] expressions juxtaposed or separated only by a | character
are merged together into a single [] expression matching all the characters in either
of the original [] expressions. (e.g., "[[ab][bc]]" is equivalent to "[abc]", and so
is "[[ab]|[bc]]". NOTE: "[ab][bc]" is NOT the same thing as "[[ab][bc]]".
The first expression will match two characters: an a or b followed by either another
b or a c. The second expression will match a single character, which may be a, b, or c.
The nesting is required for the expressions to merge together. |
& |
Two nested [] expressions with only & between them will match any
character that appears in both nested [] expressions (this is a set intersection).
(e.g., "[[ab]&[bc]]" will only match the letter b.) |
- |
Two nested [] expressions with - between them will match any
character that appears in the first nested [] expression but not the
second one (this is an asymmetrical set difference). (e.g., "[[:Sc:]-[$]]"
matches any currency symbol except the dollar sign. "[[ab]-[bc]] will match
only the letter a. This has exactly the same effect as "[[ab]&[^bc]]".) |
For a more complete explanation, see http://www.ibm.com/java/education/boundaries/boundaries.html.
For examples, see the resource data (which is annotated).
- Author:
- Richard Gillam
$RCSfile: RuleBasedBreakIterator.java,v $ $Revision: 1.11 $ $Date: 2001/02/06 22:37:30 $
Inner Class Summary |
protected class |
RuleBasedBreakIterator.Builder
The Builder class has the job of constructing a RuleBasedBreakIterator from a
textual description. |
Field Summary |
protected static byte |
IGNORE
A token used as a character-category value to identify ignore characters |
Constructor Summary |
RuleBasedBreakIterator(java.lang.String description)
Constructs a RuleBasedBreakIterator according to the description
provided. |
Method Summary |
protected static void |
checkOffset(int offset,
java.text.CharacterIterator text)
Throw IllegalArgumentException unless begin <= offset < end. |
java.lang.Object |
clone()
Clones this iterator. |
int |
current()
Returns the current iteration position. |
static void |
debugPrintln(java.lang.String s)
|
boolean |
equals(java.lang.Object that)
Returns true if both BreakIterators are of the same class, have the same
rules, and iterate over the same text. |
int |
first()
Sets the current iteration position to the beginning of the text. |
int |
following(int offset)
Sets the iterator to refer to the first boundary position following
the specified position. |
java.text.CharacterIterator |
getText()
Return a CharacterIterator over the text being analyzed. |
protected int |
handleNext()
This method is the actual implementation of the next() method. |
protected int |
handlePrevious()
This method backs the iterator back up to a "safe position" in the text. |
int |
hashCode()
Compute a hashcode for this BreakIterator |
boolean |
isBoundary(int offset)
Returns true if the specfied position is a boundary position. |
int |
last()
Sets the current iteration position to the end of the text. |
protected int |
lookupBackwardState(int state,
int category)
Given a current state and a character category, looks up the
next state to transition to in the backwards state table. |
protected int |
lookupCategory(char c)
Looks up a character's category (i.e., its category for breaking purposes,
not its Unicode category) |
protected int |
lookupState(int state,
int category)
Given a current state and a character category, looks up the
next state to transition to in the state table. |
protected RuleBasedBreakIterator.Builder |
makeBuilder()
Creates a Builder. |
int |
next()
Advances the iterator to the next boundary position. |
int |
next(int n)
Advances the iterator either forward or backward the specified number of steps. |
int |
preceding(int offset)
Sets the iterator to refer to the last boundary position before the
specified position. |
int |
previous()
Advances the iterator backwards, to the last boundary preceding this one. |
void |
setText(java.text.CharacterIterator newText)
Set the iterator to analyze a new piece of text. |
java.lang.String |
toString()
Returns the description used to create this iterator |
protected void |
writeSwappedInt(int x,
java.io.DataOutputStream out,
boolean littleEndian)
|
protected void |
writeSwappedShort(short x,
java.io.DataOutputStream out,
boolean littleEndian)
|
void |
writeTablesToFile(java.io.FileOutputStream file,
boolean littleEndian)
|
Methods inherited from class java.lang.Object |
finalize, getClass, notify, notifyAll, wait, wait, wait |
IGNORE
protected static final byte IGNORE
- A token used as a character-category value to identify ignore characters
RuleBasedBreakIterator
public RuleBasedBreakIterator(java.lang.String description)
- Constructs a RuleBasedBreakIterator according to the description
provided. If the description is malformed, throws an
IllegalArgumentException. Normally, instead of constructing a
RuleBasedBreakIterator directory, you'll use the factory methods
on BreakIterator to create one indirectly from a description
in the framework's resource files. You'd use this when you want
special behavior not provided by the built-in iterators.
makeBuilder
protected RuleBasedBreakIterator.Builder makeBuilder()
- Creates a Builder.
clone
public java.lang.Object clone()
- Clones this iterator.
- Overrides:
clone
in class BreakIterator
- Returns:
- A newly-constructed RuleBasedBreakIterator with the same
behavior as this one.
equals
public boolean equals(java.lang.Object that)
- Returns true if both BreakIterators are of the same class, have the same
rules, and iterate over the same text.
- Overrides:
equals
in class java.lang.Object
toString
public java.lang.String toString()
- Returns the description used to create this iterator
- Overrides:
toString
in class java.lang.Object
hashCode
public int hashCode()
- Compute a hashcode for this BreakIterator
- Overrides:
hashCode
in class java.lang.Object
- Returns:
- A hash code
writeTablesToFile
public void writeTablesToFile(java.io.FileOutputStream file,
boolean littleEndian)
throws java.io.IOException
writeSwappedShort
protected void writeSwappedShort(short x,
java.io.DataOutputStream out,
boolean littleEndian)
throws java.io.IOException
writeSwappedInt
protected void writeSwappedInt(int x,
java.io.DataOutputStream out,
boolean littleEndian)
throws java.io.IOException
first
public int first()
- Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
- Overrides:
first
in class BreakIterator
- Returns:
- The offset of the beginning of the text.
last
public int last()
- Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
- Overrides:
last
in class BreakIterator
- Returns:
- The text's past-the-end offset.
next
public int next(int n)
- Advances the iterator either forward or backward the specified number of steps.
Negative values move backward, and positive values move forward. This is
equivalent to repeatedly calling next() or previous().
- Overrides:
next
in class BreakIterator
- Parameters:
n
- The number of steps to move. The sign indicates the direction
(negative is backwards, and positive is forwards).- Returns:
- The character offset of the boundary position n boundaries away from
the current one.
next
public int next()
- Advances the iterator to the next boundary position.
- Overrides:
next
in class BreakIterator
- Returns:
- The position of the first boundary after this one.
previous
public int previous()
- Advances the iterator backwards, to the last boundary preceding this one.
- Overrides:
previous
in class BreakIterator
- Returns:
- The position of the last boundary position preceding this one.
checkOffset
protected static final void checkOffset(int offset,
java.text.CharacterIterator text)
- Throw IllegalArgumentException unless begin <= offset < end.
following
public int following(int offset)
- Sets the iterator to refer to the first boundary position following
the specified position.
- Overrides:
following
in class BreakIterator
- Returns:
- The position of the first break after the current position.
preceding
public int preceding(int offset)
- Sets the iterator to refer to the last boundary position before the
specified position.
- Overrides:
preceding
in class BreakIterator
- Returns:
- The position of the last boundary before the starting position.
isBoundary
public boolean isBoundary(int offset)
- Returns true if the specfied position is a boundary position. As a side
effect, leaves the iterator pointing to the first boundary position at
or after "offset".
- Overrides:
isBoundary
in class BreakIterator
- Parameters:
offset
- the offset to check.- Returns:
- True if "offset" is a boundary position.
current
public int current()
- Returns the current iteration position.
- Overrides:
current
in class BreakIterator
- Returns:
- The current iteration position.
getText
public java.text.CharacterIterator getText()
- Return a CharacterIterator over the text being analyzed. This version
of this method returns the actual CharacterIterator we're using internally.
Changing the state of this iterator can have undefined consequences. If
you need to change it, clone it first.
- Overrides:
getText
in class BreakIterator
- Returns:
- An iterator over the text being analyzed.
setText
public void setText(java.text.CharacterIterator newText)
- Set the iterator to analyze a new piece of text. This function resets
the current iteration position to the beginning of the text.
- Overrides:
setText
in class BreakIterator
- Parameters:
newText
- An iterator over the text to analyze.
handleNext
protected int handleNext()
- This method is the actual implementation of the next() method. All iteration
vectors through here. This method initializes the state machine to state 1
and advances through the text character by character until we reach the end
of the text or the state machine transitions to state 0. We update our return
value every time the state machine passes through a possible end state.
handlePrevious
protected int handlePrevious()
- This method backs the iterator back up to a "safe position" in the text.
This is a position that we know, without any context, must be a break position.
The various calling methods then iterate forward from this safe position to
the appropriate position to return. (For more information, see the description
of buildBackwardsStateTable() in RuleBasedBreakIterator.Builder.)
lookupCategory
protected int lookupCategory(char c)
- Looks up a character's category (i.e., its category for breaking purposes,
not its Unicode category)
lookupState
protected int lookupState(int state,
int category)
- Given a current state and a character category, looks up the
next state to transition to in the state table.
lookupBackwardState
protected int lookupBackwardState(int state,
int category)
- Given a current state and a character category, looks up the
next state to transition to in the backwards state table.
debugPrintln
public static void debugPrintln(java.lang.String s)
Copyright (c) 1998-2000 IBM Corporation and others.