# Fundamentals of Data Representation: Unicode

Jump to: navigation, search

The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:

You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode. There are several versions of unicode, each with using a different number of bits to store data:

Name Descriptions
UTF-8 8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters
UTF-16 16-bit, variable-width encoding, can expand to 32 bits.
UTF-32 32-bit, fixed-width encoding. Each character takes exactly 32-bits

With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:

code point glyph* character UTF-16 code units (hex)
U+007A z LATIN SMALL LETTER Z 007A
U+6C34 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34
U+10000 LINEAR B SYLLABLE B008 A D800, DC00
U+1D11E MUSICAL SYMBOL G CLEF D834, DD1E

You can find out more about unicode encoding on Wikipedia

 Exercise: ASCII and Unicode Without using the crib table (you won't get it in the exam!) answer the following questions: The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as Answer : 100 0111 - as it is 3 characters further on in the alphabet The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as: Answer : 110 1101 - as it is 6 characters down in the alphabet Give a benefit of using ASCII: Answer : Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode Give a benefit of using unicode over ASCII: Answer : ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages. How many different characters can 7-bit ASCII represent? Answer : 2^7 = 128 You are designing a computer system for use worldwide, what character encoding scheme should you use and why? Answer : unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic