Languages Around The World

Properties

Overview

Text processing requires that a program treat text appropriately. If text is exchanged between several systems, it is important for them to process the text consistently. This is done by assigning each character, or a range of characters, attributes or properties used for text processing, and by defining standard algorithms for at least the basic text operations.

Traditionally, such attributes and algorithms have not been well-defined for most character sets, and text processing had to rely on ad-hoc solutions. Over time, standards were created for querying properties of the system codepage. However, the set of these properties was limited. Their data was not coordinated among implementations, and standard algorithms were not available.

It is one of the strengths of Unicode that it not only defines a very large character set, but also assigns a comprehensive set of properties and usage notes to all characters. It defines standard algorithms for critical text processing, and the data is publicly provided and kept up-to-date. See http://www.unicode.org/ for more information.

Sample code is available in the ICU source code library at icu/source/samples/props/props.cpp . See also the source code for the Unicode browser demo application, which can be used online to browse Unicode characters with their properties.

Unicode Character Database properties in ICU APIs

The following table shows all Unicode Character Database properties (except for purely "extracted" ones and Unihan properties) and the corresponding ICU APIs. Most of the time, ICU4C provides functions in icu/source/common/unicode/uchar.h and ICU4J provides parallel functions in the com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in Java). Most properties are also available via UnicodeSet APIs and patterns.

See the Unicode Character Database itself for comparison. PropertyAliases.txt lists all properties by name and type.

Most properties that use binary, integer, or enumerated values are available via functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty enum constants to select the property. (ICU4J UCharacter member functions do not have the "u_" prefix.) The constant names include the long property name according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property value enum constant names often contain the short property name and the long value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the enumeration result type is also listed here.

Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and UnicodeSet and regular expression patterns use the long or short property aliases and property value aliases (see PropertyAliases.txt and PropertyValueAliases.txt).

There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do not use a single value but a bit-set (a mask) of zero or more values, with each bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to represent property value aliases for multiple general categories, like "Letters" (which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other words, there are two ICU properties for the same Unicode property, one delivering single values (for per-code point lookup) and the other delivering sets of values (for use with value aliases and UnicodeSet).

UCD Name
(see PropertyAliases.txt)
Type ICU4C uchar.h
ICU4J UCharacter
UCD File (.txt)
AgeUnicode version(U)C: u_charAge fills in UVersionInfo
Java: getAge returns a VersionInfo reference
DerivedAge
Alphabeticbinary(U)u_isUAlphabetic, UCHAR_ALPHABETICDerivedCoreProperties
ASCII_Hex_Digitbinary(U)UCHAR_ASCII_HEX_DIGITPropList
Bidi_Classenum UCharDirection(U)u_charDirection, UCHAR_BIDI_CLASSUnicodeData
Bidi_Controlbinary(U)UCHAR_BIDI_CONTROLPropList
Bidi_Mirroredbinary(U)u_isMirrored, UCHAR_BIDI_MIRROREDUnicodeData
Bidi_Mirroring_Glyphcode point u_charMirrorBidiMirroring
Blockenum UBlockCode (growing)(U)ublock_getCode, UCHAR_BLOCKBlocks
Canonical_Combining_Class0..255(U)u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASSUnicodeData
Case_FoldingUnicode string u_strFoldCase (ustring.h)CaseFolding
Composition_Exclusionbinary(c)contributes to Full_Composition_ExclusionCompositionExclusions
Dashbinary(U)UCHAR_DASHPropList
Decomposition_MappingUnicode string available via normalization APIUnicodeData
Decomposition_Typeenum UDecompositionType(U)UCHAR_DECOMPOSITION_TYPEUnicodeData
Default_Ignorable_Code_Pointbinary(U)UCHAR_DEFAULT​_IGNORABLE_CODE_POINTDerivedCoreProperties
Deprecatedbinary(U)UCHAR_DEPRECATEDPropList
Diacriticbinary(U)UCHAR_DIACRITICPropList
East_Asian_Widthenum UEastAsianWidth(U)UCHAR_EAST_ASIAN_WIDTHEastAsianWidth
Expands_On_NF*binary available via normalization API (unorm.h)DerivedNormal­izationProps
Extenderbinary(U)UCHAR_EXTENDERPropList
FC_NFKC_ClosureUnicode string u_getFC_NFKC_ClosureDerivedNormal­izationProps
Full_Composition_Exclusionbinary(U)UCHAR_FULL​_COMPOSITION_EXCLUSIONDerivedNormal­izationProps
General_Categoryenum (<= 32 values)(U)u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK, UCharCategoryUnicodeData
Grapheme_Basebinary(U)UCHAR_GRAPHEME_BASEDerivedCoreProperties
Grapheme_Extendbinary(U)UCHAR_GRAPHEME_EXTENDDerivedCoreProperties
Grapheme_Linkbinary(U)UCHAR_GRAPHEME_LINKDerivedCoreProperties
Hex_Digitbinary(U)UCHAR_HEX_DIGITPropList
Hyphenbinary(U)UCHAR_HYPHENPropList
ID_Continuebinary(U)UCHAR_ID_CONTINUEDerivedCoreProperties
ID_Startbinary(U)UCHAR_ID_STARTDerivedCoreProperties
Ideographicbinary(U)UCHAR_IDEOGRAPHICPropList
IDS_Binary_Operatorbinary(U)UCHAR_IDS_BINARY_OPERATORPropList
IDS_Triary_Operatorbinary(U)UCHAR_IDS_TRINARY_OPERATORPropList
ISO_CommentASCII string u_getISOCommentUnicodeData
Join_Controlbinary(U)UCHAR_JOIN_CONTROLPropList
Joining_Groupenum UJoiningGroup(U)UCHAR_JOINING_GROUPArabicShaping
Joining_Typeenum UJoiningType(U)UCHAR_JOINING_TYPEArabicShaping
Line_Breakenum ULineBreak(U)UCHAR_LINE_BREAKLineBreak
Logical_Order_Exceptionbinary(U)UCHAR_LOGICAL_ORDER_EXCEPTIONPropList
Lowercasebinary(U)u_isULowercase, UCHAR_LOWERCASEDerivedCoreProperties
Lowercase_MappingUnicode string + conditions available via u_strToLower (ustring.h)UnicodeData + SpecialCasing
Mathbinary(U)UCHAR_MATHDerivedCoreProperties
NameASCII string(U)u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME)UnicodeData
NF*_QuickCheckno/maybe/yes available via unorm_quickCheck (unorm.h)DerivedNormal­izationProps
Noncharacter_Code_Pointbinary(U)UCHAR_NONCHARACTER​_CODE_POINT, U_IS_UNICODE_NONCHAR (utf.h)PropList
Numeric_Typeenum UNumericType(U)UCHAR_NUMERIC_TYPEUnicodeData
Numeric_Valuedouble(U)u_getNumericValue
Java/UnicodeSet: only non-negative integers, no fractions
UnicodeData
Other_Alphabeticbinary(c)contributes to AlphabeticPropList
Other_Default_Ignorable​_Code_Pointbinary(c)contributes to Default_Ignorable​_Code_PointPropList
Other_Grapheme_Extendbinary(c)contributes to Grapheme_ExtendPropList
Other_Lowercasebinary(c)contributes to LowercasePropList
Other_Mathbinary(c)contributes to MathPropList
Other_Uppercasebinary(c)contributes to UppercasePropList
Quotation_Markbinary(U)UCHAR_QUOTATION_MARKPropList
Radicalbinary(U)UCHAR_RADICALPropList
Scriptenum UScriptCode (growing)(U)uscript_getCode (uscript.h), UCHAR_SCRIPTScripts
Simple_Case_Foldingcode point u_foldCaseCaseFolding
Simple_Lowercase_ Mappingcode point u_tolowerUnicodeData
Simple_Titlecase_ Mappingcode point u_totitleUnicodeData
Simple_Uppercase_ Mappingcode point u_toupperUnicodeData
Soft_Dottedbinary(U)UCHAR_SOFT_DOTTEDPropList
Special_Case_Conditionconditions available via u_strToLower etc. (ustring.h)SpecialCasing
Terminal_Punctuationbinary(U)UCHAR_TERMINAL_PUNCTUATIONPropList
Titlecase_MappingUnicode string + conditions u_strToTitle (ustring.h)UnicodeData + SpecialCasing
Unicode_1_NameASCII string(U)u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME)UnicodeData
Unified_Ideographbinary(U)UCHAR_UNIFIED_IDEOGRAPHPropList
Uppercasebinary(U)u_isUUppercase, UCHAR_UPPERCASEDerivedCoreProperties
Uppercase_MappingUnicode string + conditions u_strToUpper (ustring.h)UnicodeData + SpecialCasing
White_Spacebinary(U)u_isUWhiteSpace, UCHAR_WHITE_SPACEPropList
XID_Continuebinary(U)UCHAR_XID_CONTINUEDerivedCoreProperties
XID_Startbinary(U)UCHAR_XID_STARTDerivedCoreProperties

Notes:

Customization

ICU does not provide the means to modify properties at runtime. The properties are provided exactly as specified by a recent version of the Unicode Standard (as published in the Character Database ). However, if an application requires custom properties (for example, for Private Use characters), then it is possible to change or add them at build-time. This is done by modifying the Character Database files copied into the ICU source tree at icu/source/data/unidata. For the most common properties, the file to modify is UnicodeData.txt.

To add a character to such a file, a line must be inserted into the file with the format used in that file (see the online documentation on the Unicode site for more information). These files are processed by ICU tools at build time. For example, the genprops tool reads several of the files and writes the binary file uprops.dat, which is then packaged into the common ICU data file. It is important for the operation of those tools that the Unicode character code points of the entries are in ascending order (gaps are allowed). Any available Unicode code point (0 to 10ffff16) can be used. Code point values should be written with either 4, 5, or 6 hex digits. The minimum number of digits possible should be used (but no fewer than 4). Note that the Unicode Standard specifies that the 32 code point U+fdd0..U+fdef and the 34 code points U+...fffe and U+...ffff are not characters, therefore they should not be added to any of the character database files.

After modifying one of these files, the ICU data needs to be rebuilt. The makefiles should detect the modifications and run the necessary tools automatically.



Copyright (c) 2000 - 2004 IBM and Others - PDF Version - Feedback: icu-issues@oss.software.ibm.com

User Guide for ICU v3.2 Generated 2004-11-22.