Windows Programming/Unicode

From Wikibooks, open books for an open world
Jump to navigation Jump to search

For a reference of Unicode standard, see Unicode.

Introduction to Unicode[edit | edit source]

Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. Originally, text-characters were represented in computers using byte-wide data: each printable character (and many non-printing, or "control" characters) were implemented using a single byte each, which allowed for 256 characters total. However, globalization has created a need for computers to be able to accommodate many different alphabets from around the world.

The old codes were known as ASCII or EBCDIC, but it was apparent that neither of these codes were capable of handling all the different characters and alphabets from around the world. The solution to this problem created Unicode. Windows NT implements many of its core functions with a "wide" 16-bit characters set, close to Unicode standard, although it provides a series of functions that are compatible with the standard ASCII characters as well.

UNICODE characters are frequently called "Wide Characters", "Generic Characters", or "T Characters". This book may use any of these terms interchangeably.

Variable-Width Characters[edit | edit source]

Before Unicode, there was an internationalization attempt that introduced character strings with variable-width characters. Some characters, such as the standard ASCII characters would be 1 byte long. Other characters, such as extended character sets, were two bytes long. These types of character formats fell out of favor with the advent of UNICODE because they are harder to write and much harder to read. Windows does still maintain some functionality to deal with variable-width strings, but we won't discuss those here.

Unfortunately all advantages of using wide characters were lost because the number of characters needed quickly exceeded the 65,536 possible 16-bit values. Windows actually uses what is called UTF-16 to store characters, where a large number of characters actually take //two// words, these are called "surrogate pairs". This development is after much of the Windows API documentation was written and much of it is now obsolete. You should never treat string data as an "array of characters", instead always treat it as a null-terminated block. For instance always send the entire string to a function to draw it on the screen, do not attempt to draw each character. Any code that puts a square bracket after a LPSTR is wrong.

At the same time, variable-width character-based strings made a big comeback in the multi-platform standard called UTF-8, which is pretty much the same idea as UTF-16 except with 8-bit units. Its primary advantage is that there is no need for two APIs. The 'A' and 'W' APIs would have been the same if this were used, and since both are variable-sized, it has no disadvantage. Although most Windows programmers are unfamiliar with it, you may see increased references to using the non-UNICODE API.

Windows Implementation[edit | edit source]

The Win32 API classifies all of its functions that require text input into two categories. Some of the functions have an "A" suffix (for ASCII), and some have a "W" suffix (for Wide characters, or Unicode). These functions are differentiated using the macro "UNICODE":

#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif

Because of this differentiation, when you receive a compiler error, you will get an error on "MessageBoxW" instead of simply "MessageBox". In these cases, the compiler is not broken. It is simply trying to follow a complex set of macros.

Unicode Environment[edit | edit source]

All Windows functions that require character strings are defined in this manner. If you want to use unicode in your program, you need to explicitly define the UNICODE macro before you include the windows.h file:

#define UNICODE
#include <windows.h>

Also, some functions in other libraries require you to define the macro _UNICODE. The standard library functions can be provided in unicode by including the <tchar.h> file as well. So, to use unicode in your project, you need to make the following declarations in your project:

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <tchar.h>

Some header files include a mechanism like the following, so that when one of the two UNICODE macros is defined, the other is automatically defined as well:

#ifdef UNICODE
  #ifndef _UNICODE
    #define _UNICODE
  #endif
#endif
#ifdef _UNICODE
  #ifndef UNICODE
    #define UNICODE
  #endif
#endif

If you are writing a library that utilizes UNICODE, it might be worthwhile for you to include this mechanism in your header files as well, so that other programmers don't need to worry about including both macros.

TEXT macro[edit | edit source]

In C, to make a string of wide characters, you need to prefix the string with the letter "L". Here is an example:

char *asciimessage = "This is an ASCII string.";
wchar_t *unicodemessage = L"This is a Wide Unicode string.";

The data type "TCHAR" is defined as being a char type if unicode is not defined, and is defined as being a wide type if UNICODE is defined (in tchar.h). To make strings portable between unicode and non-unicode, we can use the TEXT() macro to automatically define a string as being unicode or not:

TCHAR *automessage = TEXT("This message can be either ASCII or UNICODE!");

Using TCHAR data types, and the TEXT macro are important steps in making your code portable between different environments.

Also, the TEXT macro can be written as:

TEXT("This is a generic string");
_T("This is also a generic string");
T("This is also a generic string");

All three of these statements are equivalent.

The TEXT macro is typically defined like this:

#ifdef UNICODE
#define TEXT(t) L##t
#define _T(t) L##t
#define T(t) L##t
#else
#define TEXT(t) t
#define _T(t) t
#define T(t) t
#endif

Unicode Reference[edit | edit source]

Control Characters[edit | edit source]

Unicode characters 0 to 31 (U+0000 to U+001F) are part of the C0 Controls and Basic Latin block. They are all control characters. These characters correspond to the first 32 characters of the ASCII set.

Code point Decimal equivalent Name
U+0000 0 null character
U+0001 1 start of header
U+0002 2 start of text
U+0003 3 end of text
U+0004 4 end of transmission
U+0005 5 inquiry
U+0006 6 acknowledgment
U+0007 7 bell
U+0008 8 backspace
U+0009 9 horizontal tab
U+000A 10 line feed
U+000B 11 vertical tab
U+000C 12 form feed
U+000D 13 carriage return
U+000E 14 shift out
U+000F 15 shift in
U+0010 16 data link escape
U+0011 17 device control 1
U+0012 18 device control 2
U+0013 19 device control 3
U+0014 20 device control 4
U+0015 21 negative acknowledgment
U+0016 22 synchronous idle
U+0017 23 end of transmission block
U+0018 24 cancel
U+0019 25 end of medium
U+001A 26 substitute
U+001B 27 escape
U+001C 28 file separator
U+001D 29 group separator
U+001E 30 record separator
U+001F 31 unit separator

Next Chapter[edit | edit source]