
ICU Search String Service
String searching, also known as string matching, is a very important subject in the wider domain of text processing and analysis. Many software applications use the basic string search algorithm in the implementations on most operating systems. With the popularity of internet, the quantity of available data from different parts of the world has increased dramatically within a short time. Therefore, a string search algorithm that is language-aware has become more important. A bitwise match that uses the u_strstr (C), UnicodeString::indexOf (C++) or String.indexOf (Java) APIs will not yield the correct result specific to a particular language's requirements. The APIs will not yield the correct result because all the issues that are important to language-sensitive collation are also applicable to text searching. The following lists those issues which are applicable to text searching:
-
The accented letters
In English, accents are treated as minor variations of a letter. In French, accented letters have much more significance as they can actually change the meaning of a word. Very often, an accented letter is actually a distinct letter. For example, letter 'Å' (\u00c5) may be just a letter 'A' followed by an accent symbol to English speakers. However, it is actually a distinct letter in Danish. In some cases, such as in traditional German, an accented letter is short-hand for something longer. In sorting, an 'ä' (\u00e4) is treated as 'ae'. -
The conjoined letters
Special handling is required when a single letter is treated equivalent to two distinct letters and vice versa. For example, in German, the letter 'ß' (\u00df) is treated as 'ss' in sorting. Also, in most languages, 'æ' (\u00e6) is considered equivalent to the letter 'a' followed by the letter 'e'. Also, the ligatures are often treated as distinct letters by themselves. For example, 'ch' is treated as a distinct letter between the letter 'c' and the letter 'd' in Spanish. -
Ignorable punctuation
As in collation, it is important that the user is able to choose to ignore punctuation symbols while the user searches for a pattern in the string. For example, a user may search for "blackbird" and want to include entries such as "black-bird".
Though the brute force algorithm works well in locating a match without error, many improvements can be made to provide better performance. A new set of APIs is available that provides a language-sensitive string search service. The ICU string search service uses the Boyer-Moore searching algorithm based on automata or combinatorial properties of strings and pre-processes the pattern.
ICU String Search Model
The ICU string search service provides similar APIs to the other text iterating services. Allowing users to specify the starting position and direction within the text string to be searched. For more information, please see BreakIterator . The user can locate one or all occurrences of a pattern in a string. For a given collator, a pattern match is located at the offsets <start, end> in a string if the collator finds that the sub-string between the start and end is equal.
The string search service provides two options to handle accent matching as described below:
Let S' be the sub-string of a text string S between the offsets start and end <start, end>.
A pattern string P matches a text string S at the offsets <start, end> if
-
option 1. P matches some canonical equivalent string of S'. Suppose the collator used for searching has a tertiary collation strength, all accents are non-ignorable. If the pattern "a\u0300" is searched in the target text "a\u0325\u0300", a match will be found, since the target text is canonically equivalent to "a\u0300\u0325"
-
option 2. P matches S' and if P starts or ends with a combining mark, there exists no non-ignorable combining mark before or after S' in S respectively. Following the example above, the pattern "a\u0300" will not find a match in "a\u0325\u0300", since there exists a non-ignorable accent '\u0325' in the middle of 'a' and '\u0300'. Even with a target text of "a\u0300\u0325" a match will not be found because of the non-ignorable trailing accent \u0325.
One restriction is to be noted for option 1. Currently there are no composite characters that consists of a character with combining class greater than 0 before a character with combining class equals to 0. However, if such a character exists in the future, the string search service may not work correctly with option 1 when such characters are encountered.
Furthermore, option 1 could generate more than one "encompassing" matches. For example, in Danish, 'å' (\u00e5) and 'aa' are considered equivalent. So the pattern "baad" will match "a--båd--man" (a--b\u00e5d--man). However, "baad" will match "a--båd--man" (a--b\u00e5d--man) both at starting offset 3 but also at starting offset 1 and 2. The end offset can be either 5, 6, or 7. To be more exact, the string search added a "tightest" match condition. In other words, if the pattern matches at offsets <start, end> as well as offsets <start + 1, end>, the offsets <start, end> are not considered a match. Likewise, if the pattern matches at offsets <start, end> as well as offsets <start, end + 1>, the offsets <start, end + 1> are not considered a match. Therefore, when the option 1 is chosen in Danish collator, 'baad' will match in the string "a--båd--man" (a--b\u00e5d--man) ONLY at offsets <3,5>.
As in other iterator interfaces, the string search service provides APIs to perform string matching for the first pattern occurrence, immediate next, previous match, and the last pattern occurrence. There are also options to allow for overlapping matching. For example, in English, if the string is "ababab" and the pattern is "abab", overlapping matching produces results of offsets <0, 3> and <2, 5>. Otherwise, the mutually exclusive matching produces the result offset <0, 3> only. To find a whole word match, the user can provide a locale-specific BreakIterator object to a StringSearch instance to correctly locate the word boundaries. For example, if "c" exists in the string "abc", a match is returned. However, the behavior can be overwritten by supplying a word BreakIterator.
Both a locale or collator can be used to specify the language-sensitive rules for searches. When a locale is specified, a collator will be created internally and the StringSearch instance that is created is responsible for the ownership of the collator. All the collation attributes will be considered during the string search operation. However, the users only can set the collator attributes using the collator APIs. Normalization is usually done within collation and the process is outside the scope of the string search service. Therefore, the result offsets may contain extra combining characters at either the beginning or the end of the match. If the start of the match lies within a range of normalized characters, the start offset returned will be one character after the immediate preceding base letter. If the end of the match lies within a range of normalized characters, the end offset returned will be one character before the immediate following base letter. For example, the pattern "´¸" (\u00b4\u00b8) is considered a match in string "A´¨¸B" (A\u00b4\u00a8\u00b8B) at offsets <1, 3>. It is important to note that the pre-composed characters are treated equivalent to their decomposed counterparts. For example, if the user searches for the pattern "ˋ" (\u02cb) in the string "ÀBC", (\u00c0BC) a match will be found at offsets <0, 1>. Currently, there is no existing pre-composed character that decomposes in NFD to a character sequence with accents before a base letter. The string search service incorporates decomposition and optimizes it for boundary checking.
When there are contractions in the collation sequence and the contraction happens to span across the boundary of a match, it is not considered a match. For example, in traditional Spanish where 'ch' is a contraction, the "har" pattern will not match in the string "uno charo". Boundaries that are discontiguous contractions will yield a match result similar to those described above, where the end of the match returned will be one character before the immediate following base letter. In addition, only the first match will be located if a pattern contains only combining marks and the search string contains more than one occurrences of the pattern consecutively. For example, if the user searches for the pattern "´" (\u00b4) in the string "A´´B", (A\u00b4\u00b4B) the result will be offsets <1, 2>.
Example
In C:
char *tgtstr = "A quick brown fox jumped over the lazy dog."; char *patstr = "FoX"; UChar target[64]; UChar pattern[16]; int pos = 0; UErrorCode status = U_ZERO_ERROR; u_uastrcpy(target, tgtstr); u_uastrcpy(pattern, patstr); UStringSearch *search = usearch_open(pattern, -1, target, -1, "en_US", &status); if (U_FAILURE(status)) return; while(TRUE) { pos = usearch_next(search); if (pos = USEARCH_DONE) { fprintf(stdout, "No match found for pattern.\n"); break; } else { fprintf(stdout, "Match found for pattern at position %d.\n", pos); } pos = usearch_next(search); } } |
In C++:
UErrorCode status = U_ZERO_ERROR; UnicodeString target("A quick fox jumped over the lazy dog.", ""); UnicodeString easyPatterns = "FoX"; int pos = 0; StringSearch *searchIter = new StringSearch(easyPatterns, target, Locale::US, status); If (U_FAILURE(status)) return; while (TRUE) { status = U_ZERO_ERROR; pos = searchIter->next(); if (pos == U_SEARCH_DONE) fprintf(stdout, "No match found for pattern.\n"); break; } else { fprintf(stdout, "Match found for pattern at position %d.\n", pos); } } |
In Java:
StringCharacterIterator target = new StringCharacterIterator( "A quick fox jumped over the lazy dog."); String easyPatterns = "FoX"; try { StringSearch searchIter = new StringSearch(easyPatterns, target, Locale.US); while (true) { int pos = searchIter.next(); if (pos == StringSearch.DONE) System.out.println("No match found for pattern"); break; } else { System.out.println("Match found for pattern at position " + pos); } } } catch (Exception e) { System.err.println("StringSearch failure"); e.printStackTrace(); } |
Performance and Other Implications
The ICU string search service is designed to be on top of the ICU collation service. Therefore, all the performance implications that apply to a collator are also applicable to the string search service. To obtain the best performance, use the default collator attributes described in the Performance and Storage Implications on Attributes . In addition, users need to be aware of the following StringSearch specific considerations:
Change Iterating Direction
The ICU string search service provides a set of very dynamic APIs that allow users to change the iterating direction randomly. For example, users can search for a particular word going forward by calling the usearch_next (C), StringSearch::next (C++) or StringSearch.next (Java) APIs and then search backwards at any point of the search operation by calling the usearch_previous (C), StringSearch::previous (C++) or StringSearch.previous (Java) APIs. Another way to change the iterating direction is by calling the usearch_reset (C), StringSearch::previous (C++) or StringSearch.previous (Java) APIs. Though the direction change can occur without calling the reset APIs first, this operation comes with a reduction in speed.
Roundtripping Results
The matching results in the forward direction will, in general, match the results in the backwards direction in the reverse order. However, this match is not guaranteed. For example, if the pattern consists of prefix accents and a match with a starting discontinguous boundary is found, the resulting start offset of the match includes the initial base letter in the discontiguous contraction or does not depend on the direction of the search. Assuming that we are searching for the accent "¨" (\u00a8) in "X´¨¸" (X\u00b4\u00a8\u00b8) and that "X´¸" (X\u00b4\u00b8) is a contraction sequence, the string search service will provide a match result at offsets <0, 4> during a forward search but offsets <1,3> during a backward search.
Thai and Lao Character Boundaries
In collation, certain Thai and Lao vowels are swapped with the next character. For example, the text string "A ขเ" (A \u0e02\u0e40) is processed internally in collation as "A เข" (A \u0e40\u0e02). Therefore, if the user searches for the pattern "Aเ" (A\u0e40) in "A ขเ" (A \u0e02\u0e40) the string search service will match starting at offset 0. Since this normalization process is internal to collation, there is no notification that the swapping has happened. The return result offsets in this example will be <0, 2> even though the range would encompass one extra character.
Canonical Equivalence
In collation process, if normalization is on, any string will be compared as if it is canonically equivalent. However, FCD (fast C or D form) text is guaranteed to sort correctly regardless of the normalization. This process works as long as the pattern is within the interior of the search string. However, if the pattern matches at the boundaries of the search string, the matching may be confusing. For example, if the user searches for the pattern "¸c´" (\u00b8c\u00b4) in the string "a¨¸c´e" (a\u00a8\u00b8c\u00b4e), the match is located at offsets <2, 4>. If the search string is normalized, the normalized search string will be "a¨¸c´e" (a\u00b8\u00a8c\u00b4e). Without further processing, a match cannot be located. In order to ensure canonical equivalence, the user is provided with two search options presented at the beginning of this document to ensure that the same result should be returned in either case
Accents refer to characters that have a non-zero canonical combining order and have non-zero collation elements.
![]() | Not all non-zero canonical combining order characters are ignored and vice versa. A discontiguous match might occur in option 1. In this case, the match offsets <start, end> may encompass more accents at the end of the match than is expected. For example, when the user searches for the "¨c¸" (\u00a8c\u00b8) pattern in "a¨c´¸e" (a\u00a8c\u00b4\u00b8e) with the normalization mode turned on, a match is found. Although option 2 is more restrictive, it allows users to search for Arabic consonants. Using option 2, the match is located against "consonant + vowel". However, if a user searches for "consonant + vowel1", it will not match against "consonant + vowel1 + vowel2". |
Copyright (c) 2000 - 2005 IBM and Others - PDF Version - Feedback: http://icu.sourceforge.net/contacts.html
User Guide for ICU v3.4 Generated 2005-07-27.