|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--com.ibm.text.UnicodeSet
A mutable set of Unicode characters. Objects of this class represent character classes used in regular expressions. Such classes specify a subset of the set of all Unicode characters, which in this implementation is the characters from U+0000 to U+FFFF, ignoring surrogates.
UnicodeSet
supports two APIs. The first is the
operand API that allows the caller to modify the value of
a UnicodeSet
object. It conforms to Java 2's
java.util.Set
interface, although
UnicodeSet
cannot actually implement that
interface. All methods of Set
are supported, with the
modification that they take a character range or single character
instead of an Object
, and they take a
UnicodeSet
instead of a Collection
. The
operand API may be thought of in terms of boolean logic: a boolean
OR is implemented by add
, a boolean AND is implemented
by retain
, a boolean XOR is implemented by
complement
taking an argument, and a boolean NOT is
implemented by complement
with no argument. In terms
of traditional set theory function names, add
is a
union, retain
is an intersection, remove
is an asymmetric difference, and complement
with no
argument is a set complement with respect to the superset range
MIN_VALUE-MAX_VALUE
The second API is the
applyPattern()
/toPattern()
API from the
java.text.Format
-derived classes. Unlike the
methods that add characters, add categories, and control the logic
of the set, the method applyPattern()
sets all
attributes of a UnicodeSet
at once, based on a
string pattern.
In addition, the set complement operation is supported through
the complement()
method.
Pattern syntax
Patterns are accepted by the constructors and theapplyPattern()
methods and returned by the
toPattern()
method. These patterns follow a syntax
similar to that employed by version 8 regular expression character
classes:
Any character may be preceded by a backslash in order to remove any special meaning. White space characters, as defined by Character.isWhitespace(), are ignored, unless they are escaped. Patterns specify individual characters, ranges of characters, and Unicode character categories. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '[' or '[:'. In any other location, '^' has no special meaning.
pattern :=
('[' '^'? item* ']') | ('[:' '^'? category ':]')
item :=
char | (char '-' char) | pattern-expr
pattern-expr :=
pattern | pattern-expr pattern | pattern-expr op pattern
op :=
'&' | '-'
special :=
'[' | ']' | '-'
char :=
any character that is not special
any character
| ('\')
| ('\u' hex hex hex hex)
hex :=
any character for which Character.digit(c, 16)
returns a non-negative resultcategory :=
'M' | 'N' | 'Z' | 'C' | 'L' | 'P' | 'S' | 'Mn' | 'Mc' | 'Me' | 'Nd' | 'Nl' | 'No' | 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cs' | 'Co' | 'Cn' | 'Lu' | 'Ll' | 'Lt' | 'Lm' | 'Lo' | 'Pc' | 'Pd' | 'Ps' | 'Pe' | 'Po' | 'Sm' | 'Sc' | 'Sk' | 'So'
Legend:
a := b
a
may be replaced byb
a?
zero or one instance of a
a*
one or more instances of a
a | b
either a
orb
'a'
the literal string between the quotes
Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left and right characters are the same, then the range consists of just that character. If the left character is greater than the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.
Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\u0000-\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\u0100-\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\u0100-\u01FF]]". This only really matters for difference; intersection is commutative.
[a] | The set containing 'a' |
[a-z] | The set containing 'a' through 'z' and all letters in between, in Unicode order |
[^a-z] | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF |
[[pat1][pat2]]
| The union of sets specified by pat1 and pat2 |
[[pat1]&[pat2]]
| The intersection of sets specified by pat1 and pat2 |
[[pat1]-[pat2]]
| The asymmetric difference of sets specified by pat1 and pat2 |
[:Lu:]
| The set of characters belonging to the given
Unicode category, as defined by Character.getType() ; in
this case, Unicode uppercase letters
|
[:L:]
| The set of characters belonging to all Unicode categories
starting wih 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]] .
|
Character categories.
Character categories are specified using the POSIX-like syntax
'[:Lu:]'. The complement of a category is specified by inserting
'^' after the opening '[:'. The following category names are
recognized. Actual determination of category data uses
Character.getType()
, so it reflects the underlying
implmementation used by Character
. As of Java 2 and
JDK 1.1.8, this is Unicode 2.1.2.
Normative Mn = Mark, Non-Spacing Mc = Mark, Spacing Combining Me = Mark, Enclosing Nd = Number, Decimal Digit Nl = Number, Letter No = Number, Other Zs = Separator, Space Zl = Separator, Line Zp = Separator, Paragraph Cc = Other, Control Cf = Other, Format Cs = Other, Surrogate Co = Other, Private Use Cn = Other, Not Assigned Informative Lu = Letter, Uppercase Ll = Letter, Lowercase Lt = Letter, Titlecase Lm = Letter, Modifier Lo = Letter, Other Pc = Punctuation, Connector Pd = Punctuation, Dash Ps = Punctuation, Open Pe = Punctuation, Close *Pi = Punctuation, Initial quote *Pf = Punctuation, Final quote Po = Punctuation, Other Sm = Symbol, Math Sc = Symbol, Currency Sk = Symbol, Modifier So = Symbol, Other*Unsupported by Java (and hence unsupported by UnicodeSet).
Field Summary | |
static char |
MAX_VALUE
Maximum value that can be stored in a UnicodeSet. |
static char |
MIN_VALUE
Minimum value that can be stored in a UnicodeSet. |
Constructor Summary | |
UnicodeSet()
Constructs an empty set. |
|
UnicodeSet(char start,
char end)
Constructs a set containing the given range. |
|
UnicodeSet(int category)
Constructs a set from the given Unicode character category. |
|
UnicodeSet(java.lang.String pattern)
Constructs a set from the given pattern. |
|
UnicodeSet(java.lang.String pattern,
boolean ignoreWhitespace)
Constructs a set from the given pattern. |
|
UnicodeSet(java.lang.String pattern,
java.text.ParsePosition pos,
SymbolTable symbols)
Constructs a set from the given pattern. |
|
UnicodeSet(UnicodeSet other)
Constructs a copy of an existing set. |
Method Summary | |
void |
add(char c)
Adds the specified character to this set if it is not already present. |
void |
add(char start,
char end)
Adds the specified range to this set if it is not already present. |
void |
addAll(UnicodeSet c)
Adds all of the elements in the specified set to this set if they're not already present. |
void |
applyPattern(java.lang.String pattern)
Modifies this set to represent the set specified by the given pattern. |
void |
applyPattern(java.lang.String pattern,
boolean ignoreWhitespace)
Modifies this set to represent the set specified by the given pattern, optionally ignoring whitespace. |
void |
clear()
Removes all of the elements from this set. |
void |
compact()
Reallocate this objects internal structures to take up the least possible space, without changing this object's value. |
void |
complement()
Inverts this set. |
void |
complement(char c)
Complements the specified character in this set. |
void |
complement(char start,
char end)
Complements the specified range in this set. |
void |
complementAll(UnicodeSet c)
Complements in this set all elements contained in the specified set. |
boolean |
contains(char c)
Returns true if this set contains the specified char. |
boolean |
contains(char start,
char end)
Returns true if this set contains every character in the specified range of chars. |
boolean |
containsAll(UnicodeSet c)
Returns true if the specified set is a subset of this set. |
boolean |
containsIndexValue(int v)
Returns true if this set contains any character whose low byte is the given value. |
boolean |
equals(java.lang.Object o)
Compares the specified object with this set for equality. |
int |
getRangeCount()
Iteration method that returns the number of ranges contained in this set. |
char |
getRangeEnd(int index)
Iteration method that returns the last character in the specified range of this set. |
char |
getRangeStart(int index)
Iteration method that returns the first character in the specified range of this set. |
int |
hashCode()
Returns the hash code value for this set. |
boolean |
isEmpty()
Returns true if this set contains no elements. |
void |
remove(char c)
Removes the specified character from this set if it is present. |
void |
remove(char start,
char end)
Removes the specified range from this set if it is present. |
void |
removeAll(UnicodeSet c)
Removes from this set all of its elements that are contained in the specified set. |
void |
retain(char c)
Retain the specified character from this set if it is present. |
void |
retain(char start,
char end)
Retain only the elements in this set that are contained in the specified range. |
void |
retainAll(UnicodeSet c)
Retains only the elements in this set that are contained in the specified set. |
void |
set(char start,
char end)
Make this object represent the range start - end . |
void |
set(UnicodeSet other)
Make this object represent the same set as other . |
int |
size()
Returns the number of elements in this set (its cardinality), n, where 0 <= n <= 65536 . |
java.lang.String |
toPattern()
Returns a string representation of this set. |
java.lang.String |
toString()
Return a programmer-readable string representation of this object. |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
public static final char MIN_VALUE
public static final char MAX_VALUE
Constructor Detail |
public UnicodeSet()
public UnicodeSet(UnicodeSet other)
public UnicodeSet(char start, char end)
end >
start
then an empty set is created.start
- first character, inclusive, of rangeend
- last character, inclusive, of rangepublic UnicodeSet(java.lang.String pattern)
pattern
- a string specifying what characters are in the setjava.lang.IllegalArgumentException
- if the pattern contains
a syntax error.public UnicodeSet(java.lang.String pattern, boolean ignoreWhitespace)
pattern
- a string specifying what characters are in the setignoreWhitespace
- if true, ignore characters for which
Character.isWhitespace() returns truejava.lang.IllegalArgumentException
- if the pattern contains
a syntax error.public UnicodeSet(java.lang.String pattern, java.text.ParsePosition pos, SymbolTable symbols)
pattern
- a string specifying what characters are in the setpos
- on input, the position in pattern at which to start parsing.
On output, the position after the last character parsed.symbols
- a symbol table mapping variables to char[] arrays
and chars to UnicodeSetsjava.lang.IllegalArgumentException
- if the pattern
contains a syntax error.public UnicodeSet(int category)
category
- an integer indicating the character category as
returned by Character.getType()
.java.lang.IllegalArgumentException
- if the given
category is invalid.Method Detail |
public void set(char start, char end)
start - end
.
If end > start
then this object is set to an
an empty range.start
- first character in the set, inclusivepublic void set(UnicodeSet other)
other
.other
- a UnicodeSet
whose value will be
copied to this objectpublic final void applyPattern(java.lang.String pattern)
pattern
- a string specifying what characters are in the setjava.lang.IllegalArgumentException
- if the pattern
contains a syntax error.public void applyPattern(java.lang.String pattern, boolean ignoreWhitespace)
pattern
- a string specifying what characters are in the setignoreWhitespace
- if true then characters for which
Character.isWhitespace() returns true are ignoredjava.lang.IllegalArgumentException
- if the pattern
contains a syntax error.public java.lang.String toPattern()
public int size()
0 <=
n <= 65536
.public boolean isEmpty()
public boolean contains(char start, char end)
end > start
then the results of this method
are undefined.public boolean contains(char c)
contains
in interface UnicodeFilter
public boolean containsIndexValue(int v)
public void add(char start, char end)
end > start
then an empty range is added, leaving the set unchanged.start
- first character, inclusive, of range to be added
to this set.end
- last character, inclusive, of range to be added
to this set.public final void add(char c)
public void retain(char start, char end)
end > start
then an empty range is
retained, leaving the set empty.start
- first character, inclusive, of range to be retained
to this set.end
- last character, inclusive, of range to be retained
to this set.public final void retain(char c)
public void remove(char start, char end)
end > start
then an empty range is
removed, leaving the set unchanged.start
- first character, inclusive, of range to be removed
from this set.end
- last character, inclusive, of range to be removed
from this set.public final void remove(char c)
public void complement(char start, char end)
end > start
then an empty range is complemented, leaving the set unchanged.start
- first character, inclusive, of range to be removed
from this set.end
- last character, inclusive, of range to be removed
from this set.public final void complement(char c)
public void complement()
complement(MIN_VALUE, MAX_VALUE)
.public boolean containsAll(UnicodeSet c)
c
- set to be checked for containment in this set.public void addAll(UnicodeSet c)
c
- set whose elements are to be added to this set.add(char, char)
public void retainAll(UnicodeSet c)
c
- set that defines which elements this set will retain.public void removeAll(UnicodeSet c)
c
- set that defines which elements will be removed from
this set.public void complementAll(UnicodeSet c)
c
- set that defines which elements will be complemented from
this set.public void clear()
public int getRangeCount()
getRangeStart(int)
,
getRangeEnd(int)
public char getRangeStart(int index)
ArrayIndexOutOfBoundsException
- if index is outside
the range 0..getRangeCount()-1
getRangeCount()
,
getRangeEnd(int)
public char getRangeEnd(int index)
ArrayIndexOutOfBoundsException
- if index is outside
the range 0..getRangeCount()-1
getRangeStart(int)
,
getRangeEnd(int)
public void compact()
public boolean equals(java.lang.Object o)
equals
in class java.lang.Object
o
- Object to be compared for equality with this set.public int hashCode()
hashCode
in class java.lang.Object
Object.hashCode()
public java.lang.String toString()
toString
in class java.lang.Object
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |