
UnicodeSet
Overview
A UnicodeSet is an object that matches a set of Unicode characters. The contents of that object can be specified either by patterns or by building them programmatically.
UnicodeSet Patterns
Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have ranges indicated by a '-' between two characters, as in "a-z". The sequence specifies the range of all characters from the left to the right, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity as [a c d-f m] means the same as [acd-fm].
Unicode property sets are specified by a Unicode property, such as [:Letter:]. ICU version 2.0 supports General Category, Script, and Numeric Value properties (ICU will support additional properties in the future). For a list of the property names, see the end of this section. The syntax for specifying the property names is an extension of either POSIX or Perl syntax with the addition of "=value". For example, you can match letters by using the POSIX syntax [:Letter:], or by using the Perl-style syntax \u005cp{Letter}. The type can be omitted for the Category and Script properties, but is required for other properties.
The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative", which is a property that excludes all characters of a given kind. For example, [:^Letter:] matches all characters that are not [:Letter:].
Positive | Negative | |
---|---|---|
POSIX-style Syntax | [:type=value:] | [:^type=value:] |
Perl-style Syntax | \p{type=value} | \P{type=value} |
These following low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):
To union two sets, simply concatenate them. For example, [[:letter:] [:number:]]
To intersect two sets, use the '&' operator. For example, [[:letter:] & [a-z]]
To take the set-difference of two sets, use the '-' operator. For example, [[:letter:] - [a-z]]
To invert a set, place a '^' immediately after the opening '['. For example, [^a-z]. In any other location, the '^' does not have a special meaning.
The binary operators '&' and '-' have equal precedence and bind left-to-right. Thus [[:letter:]-[a-z]-[\u0100-\u01FF]] is equivalent to [[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set [[ace][bdf] - [abc][def]] is not the empty set, but instead the set [def]. This only really matters for the difference operation, as the intersection operation is commutative.
Another caveat with the '&' and '-' operators is that they operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of uppercase letters except for 'A', enclose the 'A' in a set: [[:Lu:]-[A]].
[a] | The set containing 'a' |
---|---|
[a-z] | The set containing 'a' through 'z' and all letters in between, in Unicode order |
[^a-z] | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF |
[[pat1][pat2]] | The union of sets specified by pat1 and pat2 |
[[pat1]&[pat2]] | The intersection of sets specified by pat1 and pat2 |
[[pat1]-[pat2]] | The asymmetric difference of sets specified by pat1 and pat2 |
[:Lu:] | The set of characters belonging to the given Unicode category, as defined by Character.getType(); in this case, Unicode uppercase letters. The long form for this is [:UppercaseLetter:]. |
[:L:] | The set of characters belonging to all Unicode categories starting with 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:Letter:]. |
Character Quoting and Escaping in Unicode Set Patterns
SINGLE QUOTE
Two single quotes represents a single quote, either inside or outside single quotes.
Text within single quotes is not interpreted in any way (except for two adjacent single quotes). It is taken as literal text (special characters become non-special).
Enclosing a run of characters may imply grouping. For example, in regular-expression-like environments, the single-quoted text is treated as a unit with regard to trailing quantifiers. The pattern "a'bc'*" matches each of the following: "a", "abc", "abcbc", but not "abcc".
BACKSLASH ESCAPES
Outside of single quotes, certain backslashed characters have special meaning:
\uhhhh | Exactly 4 hex digits; h in [0-9A-Fa-f] |
---|---|
\Uhhhhhhhh | Exactly 8 hex digits |
\xhh | 1-2 hex digits |
\ooo | 1-3 octal digits; o in [0-7] |
\a | U+0007 (BELL) |
\b | U+0008 (BACKSPACE) |
\t | U+0009 (HORIZONTAL TAB) |
\n | U+000A (LINE FEED) |
\v | U+000B (VERTICAL TAB) |
\f | U+000C (FORM FEED) |
\r | U+000D (CARRIAGE RETURN) |
\\ | U+005C (BACKSLASH) |
Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{Lu} is the set of uppercase letters in UnicodeSet.
Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal characters. (In contrast, javac treats Unicode escapes as just a way to represent arbitrary characters in an ASCII source file, and any resulting characters are _not_ tagged as literals.)
WHITESPACE
Whitespace (as defined by our API) is ignored unless it is quoted or backslashed.
![]() | The rules for quoting and white space handling are common to most ICU APIs that process rule or expression strings, including UnicodeSet, Transliteration and (coming soon now) Break Iterators. |
Programmatically Building UnicodeSets
ICU users can programmatically build a UnicodeSet by adding or removing ranges of characters or by using the retain (intersection), remove (difference), and add (union) operations. The following shows some examples:
Property Values
The following property value variants are recognized:
short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) |
---|---|
medium | uses an abbreviated type and value |
long | uses a full type and value |
If the type or value is omitted, then the equals sign is also omitted. The short style is only used for Category and Script properties because these properties are very common and their omission is unambiguous.
In actual practice, you can mix type names and values that are omitted, abbreviated, or full. For example, if Category=Unassigned you could use what is in the table explicitly, \p{gc=Unassigned}, \p{Category=Cn}, or \p{Unassigned}.
When these are processed, case and whitespace are ignored so you may use them for clarity, if desired. For example, \p{Category = Uppercase Letter} or \p{Category = uppercase letter}.
![]() | The Category property is already supported by UnicodeSet in ICU 1.6, but only in the short form. There are also the following special values in the Category: |
For a list of supported properties, see the Properties section.
Copyright (c) 2000 - 2005 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html
User Guide for ICU v3.4 Generated 2005-07-27.