com.ibm.text
Class UTF16

java.lang.Object
  |
  +--com.ibm.text.UTF16

public final class UTF16
extends java.lang.Object

Standalone utility class providing UTF16 character conversions and indexing conversions.

Code that uses strings alone rarely need modification. By design, UTF-16 does not allow overlap, so searching for strings is a safe operation. Similarly, concatenation is always safe. Substringing is safe if the start and end are both on UTF-32 boundaries. In normal code, the values for start and end are on those boundaries, since they arose from operations like searching. If not, the nearest UTF-32 boundaries can be determined using bounds(). Examples:

The following examples illustrate use of some of these methods.

 // iteration forwards: Original
 for (int i = 0; i < s.length(); ++i) {
   char ch = s.charAt(i);
   doSomethingWith(ch);
 }

 // iteration forwards: Changes for UTF-32
 int ch;
 for (int i = 0; i < s.length(); i+=UTF16.getCharCount(ch)) {
   ch = UTF16.charAt(s,i);
   doSomethingWith(ch);
 }

 // iteration backwards: Original
 for (int i = s.length()-1; i >= 0; --i) {
   char ch = s.charAt(i);
   doSomethingWith(ch);
 }
  
 // iteration backwards: Changes for UTF-32
 int ch;
 for (int i = s.length()-1; i > 0; i-=UTF16.getCharCount(ch)) {
   ch = UTF16.charAt(s,i);
   doSomethingWith(ch);
 }
 
Notes:

Since:
Nov2400
Author:
Mark Davis, with help from Markus Scherer

Inner Class Summary
static class UTF16.StringComparator
          Compare strings using Unicode code point order, instead of UTF-16 code unit order.
 
Field Summary
static int LEAD_SURROGATE_BOUNDARY
          Value returned in bounds().
static int SINGLE_CHAR_BOUNDARY
          Value returned in bounds().
static int TRAIL_SURROGATE_BOUNDARY
          Value returned in bounds().
 
Method Summary
static java.lang.StringBuffer append(java.lang.StringBuffer target, int char32)
          Append a single UTF-32 value to the end of a StringBuffer.
static int bounds(java.lang.String source, int offset16)
          Returns the type of the boundaries around the char at offset16.
static int boundsAtCodePointOffset(java.lang.String source, int offset32)
          Returns the type of the boundaries around the char at offset32.
static int charAt(java.lang.String source, int offset16)
          Extract a single UTF-32 value from a string.
static int charAtCodePointOffset(java.lang.String source, int offset32)
          Extract a single UTF-32 value from a string.
static int countCodePoint(java.lang.String s)
          Number of codepoints in a UTF16 String
static int findCodePointOffset(java.lang.String source, int offset16)
          Returns the UTF-32 offset corresponding to the first UTF-32 boundary at or after the given UTF-16 offset.
static int findOffsetFromCodePoint(java.lang.String source, int offset32)
          Returns the UTF-16 offset that corresponds to a UTF-32 offset.
static int getCharCount(int char32)
          Determines how many chars this char32 requires.
static int getLeadSurrogate(int char32)
          Returns the lead surrogate.
static int getTrailSurrogate(int char32)
          Returns the trail surrogate.
static boolean isLeadSurrogate(char char16)
          Determines whether the character is a lead surrogate.
static boolean isSurrogate(char char16)
          Determines whether the code value is a surrogate.
static boolean isTrailSurrogate(char char16)
          Determines whether the character is a trail surrogate.
static void setCharAt(java.lang.StringBuffer source, int offset16, int char32)
          Set a code point into a UTF16 position.
static void setCharAtCodePointOffset(java.lang.StringBuffer str, int offset32, int char32)
          Sets a code point into a UTF32 position.
static java.lang.String valueOf(int char32)
          Convenience method corresponding to String.valueOf(char).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SINGLE_CHAR_BOUNDARY

public static final int SINGLE_CHAR_BOUNDARY
Value returned in bounds(). These values are chosen specifically so that it actually represents the position of the character [offset16 - (value >> 2), offset16 + (value & 3)]

LEAD_SURROGATE_BOUNDARY

public static final int LEAD_SURROGATE_BOUNDARY
Value returned in bounds(). These values are chosen specifically so that it actually represents the position of the character [offset16 - (value >> 2), offset16 + (value & 3)]

TRAIL_SURROGATE_BOUNDARY

public static final int TRAIL_SURROGATE_BOUNDARY
Value returned in bounds(). These values are chosen specifically so that it actually represents the position of the character [offset16 - (value >> 2), offset16 + (value & 3)]
Method Detail

charAt

public static int charAt(java.lang.String source,
                         int offset16)
Extract a single UTF-32 value from a string. Used when iterating forwards or backwards (with UTF16.getCharCount(), as well as random access. If a validity check is required, use UCharacter.isLegal() on the return value. If the char retrieved is part of a surrogate pair, its supplementary character will be returned. If a complete supplementary character is not found the incomplete character will be returned
Parameters:
source - array of UTF-16 chars
offset16 - UTF-16 offset to the start of the character.
Returns:
UTF-32 value for the UTF-32 value that contains the char at offset16, otherwise -1 if there's an error. The boundaries of that codepoint are the same as in bounds32().

charAtCodePointOffset

public static int charAtCodePointOffset(java.lang.String source,
                                        int offset32)
Extract a single UTF-32 value from a string. If a validity check is required, use UCharacter.isLegal() on the return value. If tbe char retrieved is part of a surrogate pair, its supplementary character will be returned. If a complete supplementary character is not found the incomplete character will be returned
Parameters:
source - array of UTF-16 chars
offset32 - UTF-32 offset to the start of the character.
Returns:
UTF-32 value for the UTF-32 value that contains the char at offset16. The boundaries of that codepoint are the same as in bounds32().

getCharCount

public static int getCharCount(int char32)
Determines how many chars this char32 requires. If a validity check is required, use isLegal() on char32 before calling.
Parameters:
ch - the input character.
Returns:
2 if is in surrogate space, otherwise 1.

bounds

public static int bounds(java.lang.String source,
                         int offset16)
Returns the type of the boundaries around the char at offset16. Used for random access.
Parameters:
source - text to analyse
offset16 - UTF-16 offset
Returns:
  • SINGLE_CHAR_BOUNDARY : a single char; the bounds are [offset16, offset16+1]
  • LEAD_SURROGATE_BOUNDARY : a surrogate pair starting at offset16; the bounds are [offset16, offset16 + 2]
  • TRAIL_SURROGATE_BOUNDARY : a surrogate pair starting at offset16 - 1; the bounds are [offset16 - 1, offset16 + 1]
For bit-twiddlers, the return values for these are chosen so that the boundaries can be gotten by: [offset16 - (value >> 2), offset16 + (value & 3)].
Throws:
java.lang.StringIndexOutOfBoundsException - if offset16 is out of bounds.

boundsAtCodePointOffset

public static int boundsAtCodePointOffset(java.lang.String source,
                                          int offset32)
Returns the type of the boundaries around the char at offset32. Used for random access.
Parameters:
source - text to analyse
offset32 - UTF-32 offset
Returns:
  • SINGLE_CHAR_BOUNDARY : a single char; the bounds are [offset32, offset32 + 1]
  • LEAD_SURROGATE_BOUNDARY : a surrogate pair starting at offset32; the bounds are [offset32, offset32 + 2]
  • TRAIL_SURROGATE_BOUNDARY : a surrogate pair starting at offset32 - 1; the bounds are [offset32 - 1, offset32 + 1]
For bit-twiddlers, the return values for these are chosen so that the boundaries can be gotten by: [offset32 - (value >> 2), offset32 + (value & 3)].
Throws:
java.lang.StringIndexOutOfBoundsException - if offset32 is out of bounds.

isSurrogate

public static boolean isSurrogate(char char16)
Determines whether the code value is a surrogate.
Parameters:
ch - the input character.
Returns:
true iff the input character is a surrogate.

isTrailSurrogate

public static boolean isTrailSurrogate(char char16)
Determines whether the character is a trail surrogate.
Parameters:
char16 - the input character.
Returns:
true iff the input character is a trail surrogate.

isLeadSurrogate

public static boolean isLeadSurrogate(char char16)
Determines whether the character is a lead surrogate.
Parameters:
char16 - the input character.
Returns:
true iff the input character is a lead surrogate

getLeadSurrogate

public static int getLeadSurrogate(int char32)
Returns the lead surrogate. If a validity check is required, use isLegal() on char32 before calling.
Parameters:
char32 - the input character.
Returns:
lead surrogate if the getCharCount(ch) is 2;
and 0 otherwise (note: 0 is not a valid lead surrogate).

getTrailSurrogate

public static int getTrailSurrogate(int char32)
Returns the trail surrogate. If a validity check is required, use isLegal() on char32 before calling.
Parameters:
char32 - the input character.
Returns:
the trail surrogate if the getCharCount(ch) is 2;
otherwise the character itself

valueOf

public static java.lang.String valueOf(int char32)
Convenience method corresponding to String.valueOf(char). Returns a one or two char string containing the UTF-32 value in UTF16 format. If the input value can't be converted, it substitutes REPLACEMENT_CHAR. If a validity check is required, use isLegal() on char32 before calling.
Parameters:
char32 - the input character.
Returns:
string value of char32 in UTF16 format

findOffsetFromCodePoint

public static int findOffsetFromCodePoint(java.lang.String source,
                                          int offset32)
Returns the UTF-16 offset that corresponds to a UTF-32 offset. Used for random access. See the class description for notes on roundtripping.
Parameters:
source - the UTF-16 string
offset32 - UTF-32 offset
Returns:
UTF-16 offset
Throws:
java.lang.StringIndexOutOfBoundsException - if offset32 is out of bounds.

findCodePointOffset

public static int findCodePointOffset(java.lang.String source,
                                      int offset16)
Returns the UTF-32 offset corresponding to the first UTF-32 boundary at or after the given UTF-16 offset. Used for random access. See the class description for notes on roundtripping.
Note: If the UTF-16 offset is into the middle of a surrogate pair, then the UTF-32 offset of the lead of the pair is returned.

To find the UTF-32 length of a string, use:

     len32 = getOffset32(source, source.length());
   

Parameters:
source - text to analyse
offset16 - UTF-16 offset < source text length.
Returns:
UTF-32 offset
Throws:
java.lang.StringIndexOutOfBoundsException - if offset16 is out of bounds.

append

public static java.lang.StringBuffer append(java.lang.StringBuffer target,
                                            int char32)
Append a single UTF-32 value to the end of a StringBuffer. If a validity check is required, use isLegal() on char32 before calling.
Parameters:
char32 - value to append. If out of bounds, substitutes UTF32.REPLACEMENT_CHAR.
Returns:
the updated StringBuffer

countCodePoint

public static int countCodePoint(java.lang.String s)
Number of codepoints in a UTF16 String
Parameters:
s - UTF16 string
Returns:
number of codepoint in string

setCharAtCodePointOffset

public static void setCharAtCodePointOffset(java.lang.StringBuffer str,
                                            int offset32,
                                            int char32)
Sets a code point into a UTF32 position.
Parameters:
str - stringbuffer
offset32 - UTF32 position to insert into
char32 - code point

setCharAt

public static void setCharAt(java.lang.StringBuffer source,
                             int offset16,
                             int char32)
Set a code point into a UTF16 position.
Parameters:
source - stringbuffer
offset16 - UTF16 position to insert into
char32 - code point


Copyright (c) 1998-2000 IBM Corporation and others.