title=Fundamentals of Data Representation: Unicode

The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:

You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode.

Each Unicode character can be encoded on a computer using three different standards, that differ on the minimum number of bits used:

Name Descriptions
UTF-8 8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters
UTF-16 16-bit, variable-width encoding, can expand to 32 bits.
UTF-32 32-bit, fixed-width encoding. Each character takes exactly 32-bits

With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:

code point glyph* character UTF-16 code units (hex)
U+6C34 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34
U+10000 Linear B Syllable B008 A.svg LINEAR B SYLLABLE B008 A D800, DC00

Exercise: ASCII and Unicode

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as


100 0111 - as it is 3 characters further on in the alphabet

The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as:


110 1101 - as it is 6 characters down in the alphabet

Give a benefit of using ASCII:


Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode

Give a benefit of using unicode over ASCII:


ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages.

How many different characters can 7-bit ASCII represent?


2^7 = 128

You are designing a computer system for use worldwide, what character encoding scheme should you use and why?


unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic