Tuesday, March 27, 2007

SNIPPET: java.lang.Character(1)

SNIPPET: java.lang.Character(1)

The Character class is a final class, that means it can’t be subclassed by any other class. It’s direct superclass is the Object class.

The Character class is a wrapper for the primitive, char and this is the only field in Character.

The Character class was designed in conformity with Unicode, Version 4.0[1]. According to the Unicode document,

  1. Characters are fixed-width 16-bit entities.
  2. Changes have been made to allow for representation of characters above 16 bits.
  3. Legal code points[2] range is from U+0000 to U+10FFFF, which are known as the Unicode scalar value[3]. U+0000 to U+FFFF set of characters are sometimes referred to as the Basic Multilingual Plane(BMP)[4].
  4. characters whose code points are above U+FFFF are called supplementary characters i.e any character above BMP is supplementary character.

In the Java 2 Platform, the UTF-16 representation is used in char array and in the String and StringBuffer classes. Supplementary characters are represented as a pair of char values, the first from the high-surrogate[6] range (\uD800 - \uDBFF), the second from the low-surrogate range (\uDC00-\uDFFF).

Note that char values represent the BMP code points, including the surrogate code points[5] and the code units of the UTF-16 encoding.

The int values represent the Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent the Unicode code points and the upper (most significant) 11 bits must be zero.

Unless specified, the behavior with respect to supplementary characters and surrogate char values take the form:

a. methods accepting a char value cannot support supplementary characters.

b. methods accepting an int value support all Unicode characters, including supplementary characters.

Footers:

[1]. According to the Unicode consortium, some features of version 4 might be deprecated since the release of version 5.

[2]. According to the Java SE API doc, Unicode code points refer to character values in the range U+0000 and U+10FFFF while Unicode code units refer to 16 bit char values that are code units of the UTF-16 encoding.

[3]. Unicode scalar value: any Unicode code point except high-surrogate and low-surrogate code points. In order words, the range of integers 0 to D7FF16 and E000016 to 10FFFF16 inclusive. As of the time of this writing, the Unicode glossary is at http://www.unicode.org/glossary/.

[4]. BMP: the BMP is called Plane 0 where a plane refers to a range of 65,536(1000016) contiguous Unicode code points, where the first code point is an integer multiple of 65,536 (1000016). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. thus Plane 0 is U+000…U+FFFF, Plane 1 is U+10000…U+1FFFF, … , Plane 16 (1016) is U+100000…U+10FFFF.

[5]. Surrogate code points: a Unicode code point in the range U+D800 through U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code points (a high surrogate followed by a low-surrogate) “stand in” for a supplementary code point.

[6]. Surrogate pair: a representation for a single abstract character that consists of a sequence of 16 bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit.

No comments: