Languages Around The World

UText

Overview

UText is a text abstraction facility for ICU

The intent is to make it possible to extend ICU to work with text data that is in formats above and beyond those that are native to ICU.  

UText makes it possible to extend ICU to work with text that

If ICU does not directly support a desired text format, it is possible for application developers themselves to extend UText, and in that way gain the ability to use their text with ICU.

UText for ICU 3.4

UText in ICU 3.4 is a technology preview, and supports only very limited set of formats and ICU services.

 Storage Forms supported by UText in ICU 3.4

The only ICU service supporting UText based input for ICU 3.4 is boundary analysis (break iteration).

In the future, the supported services may  be extended to include string search, regular expressions, and possibly others.  The supported string formats could be extended to include  include UTF-32, strings in some non-Unicode code pages, or file based text that is too large to reasonably fit in memory.

Using UText

There are three fairly distinct  classes of use of UText.  These are

UText compared with CharacterIterator

CharacterIterator is an abstract base class that defines a protocol for accessing characters in a text-storage object. This class has methods for iterating forward and backward over Unicode characters to return either the individual Unicode characters or their corresponding index values.

UText and CharacterIterator both provide an abstraction for accessing text while hiding details of the actual storage format.  UText is the more flexible of the two, however, with these advantages:

At this time, more ICU services support CharacterIterator than UText, but this is situation will improve over time.  ICU services that can operate on text represented by a CharacterIterator are

Example: Counting the Words in a UTF-8 String

Here is a function that uses UText and an ICU break iterator to count the number of words in a nul-terminated UTF-8 string.  The use of UText only adds two lines of code over what a similar function operating on normal UTF-16 strings would require.

int  countWords(const char *utf8String) {
    UText          *ut        = NULL;
    UBreakIterator *bi        = NULL;
    int             wordCount = 0;
    UErrorCode      status    = U_ZERO_ERROR;

    ut = utext_openUTF8(ut, utf8String, -1, &status);
    bi = ubrk_open(UBRK_WORD, "en_us", NULL, 0, &status);

    ubrk_setUText(bi, ut, &status);
    while (ubrk_next(bi) != UBRK_DONE) {
        if (ubrk_getRuleStatus(bi) != UBRK_WORD_NONE) {
            /* Count only words and numbers, not spaces or punctuation */
            wordCount++;
        }
    }
    utext_close(ut);
    ubrk_close(ut);
    assert(U_SUCCESS(status));
    return wordCount;
}

UText API Functions

Opening and Closing.

Normal usage of UText by an application consists of opening a UText to wrap some existing text, then passing the UText to ICU functions for processing.  For this kind of usage, all that is needed is the appropriate utext_open and close functions.

function description

uext_openUChars()

Open a UText over a standard ICU (UChar *) string.  The string consists of a UTF-16 array in memory, either nul terminated or with an explicit length.

utext_openUnicodeString()

Open a UText over an instance of an ICU C++ UnicodeString.

Utext_
openConstUnicodeString()

Open a UText over a read-only UnicodeString.  Disallows UText APIs that modify the text.

utext_openReplaceable()

Open a UText over an instance of an ICU C++ Replaceable.

utext_openUTF8()

Open a UText over a UTF-8 encoded C string.  May be either Nul terminated or have an explicit length.
utext_close Close an open UText.  Frees any allocated memory; required to prevent memory leaks.

Here are some suggestions and techniques for efficient use of UText.

Minimizing Heap Usage

Utext's open functions include features to allow applications to minimize the number of heap memory allocations that will be needed.  Specifically,

Minimizing heap allocations is important in code that has critical performance requirements, and is doubly important for code that must scale well in multithreaded, multiprocessor environments.  

Stack Allocation

Here is code for stack-allocating a UText:

    UText   mytext = UTEXT_INITIALIZER;
    utext_openUChars(&myText, ...

The first parameter to all utext_open functions is a pointer to a UText.  If it is non-null, the supplied UText will be used; if it is null, a new UText will be heap allocated.

Stack allocated UText objects must be initialized with  UTEXT_INITIALIZER.  An uninitialized instance will fail to open.

Heap Allocation

Here is code for creating a heap allocated UText:

   UText *mytext = utext_openUChars(NULL, ...

This is slightly smaller and more convenient to write than the stack allocated code, and there is no reason not to use heap allocated UText objects in the vast majority of code that does not have extreme performance constraints.

Reuse

To reuse an existing UText, simply pass it as the first parameter to any of the UText open functions.  There is no need to close the UText first, and it may actually be more efficient not to close it first.

Here is an example of a function that iterates over an array of UTF-8 strings, wrapping each in a UText and passing it off to another function.  On the first time through the loop the utext open function will heap allocate a UText.  On each subsequent iterations the existing UText will be reused.

void  f(char **strings, int numStrings) {
    UText  *ut = NULL;
    UerrorCode status;
    
    for (int i=0; i<numStrings; i++) {
        status = U_ZERO_ERROR;
        ut = utext_openUTF8(ut, strings[i], -1, &status);
       assert(U_SUCCESS(status));
       do_something(ut);
    }
    utext_close(ut);

close

Closing a  UText frees any storage associated with it, including the UText itself for those that are heap allocated.  Stack allocated UTexts should also be closed because in some cases there may be additional heap allocated storage associated with them, depending on the type of the underlying text storage.

Accessing the Text

For accessing the underlying text, UText provides functions both for iterating over the characters, and for direct random access by index.  Here are the conventions that apply for all of the access functions:

Here are the functions for accessing the actual text data represented by a UText.  The primary use of these functions will be in the implementation of ICU services that accept input in the form of a UText, although application code may also use them if the need arises.

For more detailed descriptions of each, see the API reference.

Function Description
utext_nativeLength Get the length of the text string in terms of the underlying native storage – bytes for UTF-8, for example
utext_isLengthExpensive Indicate whether determining the length of the string would require scanning the string.
utext_char32At Get the code point at the specified index.
utext_current32 Get the code point at the current iteration position.  Does not advance the position.
utext_next32 Get the next code point, iterating forwards.
utext_previous32 Get the previous code point, iterating backwards.
utext_next32From Begin a forwards iteration at a specified index.
utext_previous32From Begin a reverse iteration at a specified index.
utext_getNativeIndex Get the current iteration index.
utext_setNativeIndex Set the iteration index.
utext_moveIndex32 Move the current index forwards or backwards by the specified number of code points.  
utext_extract Retrieve a range of text, placing it into a UTF-16 buffer.
UTEXT_NEXT32 inline (high performance) version of utext_next32
UTEXT_PREVIOUS32 inline (high performance) version of utext_previous32

Modifying the Text

UText provides API for modifying or editing the text.  

Function Description
utext_replace() Replace a range of the original text with a replacement string.
utext_copy() Copy or Move a range of the text to a new position.
utext_isWritable() Test whether a UText supports writing operations.
utext_hasMetaData() Test whether the text includes metadata.  See class Replaceable for more information on meta data..

Certain conventions must be followed when modifying text using these functions:

Cloning

UText instances may be cloned.  The clone function,

uUText * utext_clone(UText *dest,
                   const UText *src,
                   UBool deep,
                   UErrorCode *status)

behaves very much like a UText open functions, with the source of the text being another UText rather than some other form of a string.

A shallow clone creates a new UText  that maintains its own iteration state, but does not clone the underlying text itself.

A deep clone copies the underlying text in addition to the UText state.  This would be appropriate if you wished to modify the text without the changes being reflected back to the original source string.  Not all text providers support deep clone, so checking for error status returns from utext_clone() is importatnt.

Thread Safety

UText follows the usual ICU conventions for thread safety: concurrent calls to functions accessing the same non-const UText is not supported.  If concurrent access to the text is required, the UText can be cloned, allowing each thread access via a separate UText.  So long as the underlying text is not being modified, a shallow clone is sufficient.

Text Providers

A text provider is a set of functions that let UText  support a specific text storage format.

ICU includes several UText text provider implementations, and applications can provide additional ones if needed.

To implement a new UText text provider, it is necessary to have an understanding of how UText is designed.  Underneath the covers, UText is a struct that includes

If a text access function (one of those described above, in the previous section) can do its thing based on the information maintained in the UText struct, it will.  If not, it will call out to one of the provider functions (below) to do the work, or to update the UText.

The best way to really understand what is required of a UText provider is to study the implementations that are included with ICU, and to borrow as much as possible.

Here is the list of text provider functions.

Function Description
UTextAccess Set up the Text Chunk associated with this UText  so that it includes a requested index position.
UTextNativeLength Return the full length of the text.
UTextClone Clone the UText.
UTextExtract Extract a range of text into a caller-supplied buffer
UTextReplace Replace a range of text with a caller-supplied replacement.  May expand or shrink the overall text.
UTextCopy Move or copy a range of text to a new position.
UTextMapOffsetToNative Within the current text chunk, translate a UTF-16 buffer offset to an absolute native index.  
UTextMapNativeIndexToUTF16 Translate an absolute native index to a UTF-16 buffer offset within the current text.
UTextClose Provider specific close.  Free storage as required.

Not every provider type requires all of the functions.  If the text type is read-only, no implementation for Replace or Copy is required.  If the text is in UTF-16 format, no implementation of the native to UTF-16 index conversions is required.

To fully understand what is required to support a new string type with UText, it will be necessary to study both the provider function declarations from utext.h and the existing text provider implementations in utext.cpp.



Copyright (c) 2000 - 2006 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html

User Guide for ICU v3.6 Generated 2006-08-31.