Jump to content

Unicode

100% developed
From Wikibooks, open books for an open world

Navigate Language Fundamentals topic: v  d  e )


Most Java program text consists of ASCII characters, but any Unicode character can be used as part of identifier names, in comments, and in character and string literals. For example, π (which is the Greek Lowercase Letter pi) is a valid Java identifier:

Example Code section 3.100: Pi.
double π = Math.PI;

and in a string literal:

Example Code section 3.101: Pi literal.
String pi = "π";

Unicode escape sequences

[edit | edit source]

Unicode characters can also be expressed through Unicode Escape Sequences. Unicode escape sequences may appear anywhere in a Java source file (including inside identifiers, comments, and string literals).

Unicode escape sequences consist of

  1. a backslash '\' (ASCII character 92, hex 0x5c),
  2. a 'u' (ASCII 117, hex 0x75)
  3. optionally one or more additional 'u' characters, and
  4. four hexadecimal digits (the characters '0' through '9' or 'a' through 'f' or 'A' through 'F').

Such sequences represent the UTF-16 encoding of a Unicode character. For example, 'a' is equivalent to '\u0061'. This escape method does not support characters beyond U+FFFF or you have to make use of surrogate pairs.[1]

Any and all characters in a program may be expressed in Unicode escape characters, but such programs are not very readable, except by the Java compiler - in addition, they are not very compact.

One can find a full list of the characters here.

π may also be represented in Java as the Unicode escape sequence \u03C0. Thus, the following is a valid, but not very readable, declaration and assignment:

Example Code section 3.102: Unicode escape sequences for Pi.
double \u03C0 = Math.PI;

The following demonstrates the use of Unicode escape sequences in other Java syntax:

Example Code section 3.103: Unicode escape sequences in a string literal.
// Declare Strings pi and quote which contain \u03C0 and \u0027 respectively:
String pi = "\u03C0";
String quote = "\u0027";

Note that a Unicode escape sequence functions just like any other character in the source code. E.g., \u0022 (double quote, ") needs to be quoted in a string just like ".

Example Code section 3.104: Double quote.
// Declare Strings doubleQuote1 and doubleQuote2 which both contain " (double quote):
String doubleQuote1 = "\"";
String doubleQuote2 = "\\u0022"; // "\u0022" doesn't work since """ doesn't work.

International language support

[edit | edit source]

The language distinguishes between bytes and characters. Characters are stored internally using UCS-2, although as of J2SE 5.0, the language also supports using UTF-16 and its surrogates. Java program source may therefore contain any Unicode character.

The following is thus perfectly valid Java code; it contains Chinese characters in the class and variable names as well as in a string literal:

Computer code Code listing 3.50: 哈嘍世界.java
public class 哈嘍世界 {
    private String 文本 = "哈嘍世界";
}

References

[edit | edit source]
  1. "3.1 Unicode", The Java™ Language Specification [1], Java SE 7 Edition, pp. 15-16.