Fundamentals of Data Representation: Unicode
The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:
- Chinese characters 汉字
- Japanese characters 漢字
- Cyrillic Кири́ллица
- Gujarati ગુજરાતી
- Urdu اردو
You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode.
Each Unicode character can be encoded on a computer using three different standards, that differ on the minimum number of bits used:
Name | Descriptions |
---|---|
UTF-8 | 8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters |
UTF-16 | 16-bit, variable-width encoding, can expand to 32 bits. |
UTF-32 | 32-bit, fixed-width encoding. Each character takes exactly 32-bits |
With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:
code point | glyph* | character | UTF-16 code units (hex) |
---|---|---|---|
U+007A | z | LATIN SMALL LETTER Z | 007A |
U+6C34 | 水 | CJK UNIFIED IDEOGRAPH-6C34 (water) | 6C34 |
U+10000 | LINEAR B SYLLABLE B008 A | D800, DC00 | |
U+1D11E | MUSICAL SYMBOL G CLEF | D834, DD1E |
You can find out more about unicode encoding on Wikipedia
Exercise: ASCII and Unicode Without using the crib table (you won't get it in the exam!) answer the following questions: The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as Answer:
100 0111 - as it is 3 characters further on in the alphabet
The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as: Answer: 110 1101 - as it is 6 characters down in the alphabet Give a benefit of using ASCII: Answer: Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode Give a benefit of using unicode over ASCII: Answer: ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages. How many different characters can 7-bit ASCII represent? Answer: 2^7 = 128 You are designing a computer system for use worldwide, what character encoding scheme should you use and why? Answer: unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic |