Fundamentals of Data Representation: Unicode

UNIT 1 - ⇑ Fundamentals of Data Representation ⇑
← ASCII	Unicode	Error checking and correction →

The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:

Chinese characters 汉字
Japanese characters 漢字
Cyrillic Кири́ллица
Gujarati ગુજરાતી
Urdu اردو

You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode. There are several versions of unicode, each with using a different number of bits to store data:

Name	Descriptions
UTF-8	8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters
UTF-16	16-bit, variable-width encoding, can expand to 32 bits.
UTF-32	32-bit, fixed-width encoding. Each character takes exactly 32-bits

With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:

code point	glyph*	character	UTF-16 code units (hex)
U+007A	z	LATIN SMALL LETTER Z	007A
U+6C34	水	CJK UNIFIED IDEOGRAPH-6C34 (water)	6C34
U+10000		LINEAR B SYLLABLE B008 A	D800, DC00
U+1D11E		MUSICAL SYMBOL G CLEF	D834, DD1E

You can find out more about unicode encoding on Wikipedia

Exercise: ASCII and Unicode

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as

Answer:

100 0111 - as it is 3 characters further on in the alphabet

The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as:

Answer:

110 1101 - as it is 6 characters down in the alphabet

Give a benefit of using ASCII:

Answer:

Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode

Give a benefit of using unicode over ASCII:

Answer:

ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages.

How many different characters can 7-bit ASCII represent?

Answer:

2^7 = 128

You are designing a computer system for use worldwide, what character encoding scheme should you use and why?

Answer:

unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic

Fundamentals of Data Representation: Unicode

Navigation menu

Search