Unicode

Navigate Language Fundamentals topic: ( v • d • e ) Statements Conditional blocks Loop blocks Boolean expressions Variables Primitive Types Arithmetic expressions Literals Methods String Objects Packages Arrays Mathematical functions Large numbers Random numbers Unicode Comments Keywords Coding conventions Lambda expressions

Most Java program text consists of ASCII characters, but any Unicode character can be used as part of identifier names, in comments, and in character and string literals. For example, π (which is the Greek Lowercase Letter pi) is a valid Java identifier:

Code section 3.100: Pi.

double π = Math.PI;

and in a string literal:

Code section 3.101: Pi literal.

String pi = "π";

Unicode escape sequences

Unicode characters can also be expressed through Unicode Escape Sequences. Unicode escape sequences may appear anywhere in a Java source file (including inside identifiers, comments, and string literals).

Unicode escape sequences consist of

a backslash '\' (ASCII character 92, hex 0x5c),
a 'u' (ASCII 117, hex 0x75)
optionally one or more additional 'u' characters, and
four hexadecimal digits (the characters '0' through '9' or 'a' through 'f' or 'A' through 'F').

Such sequences represent the UTF-16 encoding of a Unicode character. For example, 'a' is equivalent to '\u0061'. This escape method does not support characters beyond U+FFFF or you have to make use of surrogate pairs.^[1]

Any and all characters in a program may be expressed in Unicode escape characters, but such programs are not very readable, except by the Java compiler - in addition, they are not very compact.

One can find a full list of the characters here.

π may also be represented in Java as the Unicode escape sequence \u03C0. Thus, the following is a valid, but not very readable, declaration and assignment:

Code section 3.102: Unicode escape sequences for Pi.

double \u03C0 = Math.PI;

The following demonstrates the use of Unicode escape sequences in other Java syntax:

Code section 3.103: Unicode escape sequences in a string literal.

// Declare Strings pi and quote which contain \u03C0 and \u0027 respectively:
String pi = "\u03C0";
String quote = "\u0027";

Note that a Unicode escape sequence functions just like any other character in the source code. E.g., \u0022 (double quote, ") needs to be quoted in a string just like ".

Code section 3.104: Double quote.

// Declare Strings doubleQuote1 and doubleQuote2 which both contain " (double quote):
String doubleQuote1 = "\"";
String doubleQuote2 = "\\u0022"; // "\u0022" doesn't work since """ doesn't work.

International language support

The language distinguishes between bytes and characters. Characters are stored internally using UCS-2, although as of J2SE 5.0, the language also supports using UTF-16 and its surrogates. Java program source may therefore contain any Unicode character.

The following is thus perfectly valid Java code; it contains Chinese characters in the class and variable names as well as in a string literal:

Code listing 3.50: 哈嘍世界.javapublic class 哈嘍世界 {
    private String 文本 = "哈嘍世界";
}

References

↑ "3.1 Unicode", The Java™ Language Specification [1], Java SE 7 Edition, pp. 15-16.

Random numbers

Java Programming
Unicode

Comments

[1] "3.1 Unicode", The Java™ Language Specification [1], Java SE 7 Edition, pp. 15-16.

[1]