Fundamentals of Data Representation: Unicode

From Wikibooks, open books for an open world
Jump to navigation Jump to search

UNIT 1 - ⇑ Fundamentals of Data Representation ⇑

← ASCII Unicode Error checking and correction →


The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:

You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode.

Each Unicode character can be encoded on a computer using three different standards, that differ on the minimum number of bits used:

Name Descriptions
UTF-8 8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters
UTF-16 16-bit, variable-width encoding, can expand to 32 bits.
UTF-32 32-bit, fixed-width encoding. Each character takes exactly 32-bits

With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:

code point glyph* character UTF-16 code units (hex)
U+007A z LATIN SMALL LETTER Z 007A
U+6C34 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34
U+10000 LINEAR B SYLLABLE B008 A D800, DC00
U+1D11E MUSICAL SYMBOL G CLEF D834, DD1E

You can find out more about unicode encoding on Wikipedia

Exercise: ASCII and Unicode

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as

Answer:

100 0111 - as it is 3 characters further on in the alphabet

The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as:

Answer:

110 1101 - as it is 6 characters down in the alphabet

Give a benefit of using ASCII:

Answer:

Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode

Give a benefit of using unicode over ASCII:

Answer:

ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages.

How many different characters can 7-bit ASCII represent?

Answer:

2^7 = 128

You are designing a computer system for use worldwide, what character encoding scheme should you use and why?

Answer:

unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic