Fundamentals of Data Representation: ASCII and unicode

PAPER 2 - ⇑ Fundamentals of data representation ⇑
← Character form of decimal digit	ASCII and unicode	Error checking →

ASCII

The 104-key PC US English QWERTY keyboard layout evolved from the standard typewriter keyboard, with extra keys for computing.

ASCII normally uses 8 bits (1 byte) to store each character. However, the 8th bit is used as a check digit, meaning that only 7 bits are available to store each character. This gives ASCII the ability to store a total of

2^7 = 128 different values.

ASCII values can take many forms:

Numbers
Letters (capitals and lower case are separate)
Punctuation (?/|\£$ etc.)
non-printing commands (enter, escape, F1)

Take a look at your keyboard and see how many different keys you have. The number should be 104 for a windows keyboard, or 101 for traditional keyboard. With the shift function valus (a, A; b, B etc.) and recognising that some keys have repeated functionality (two shift keys, the num pad). We roughly have 128 functions that a keyboard can perform.

Binary	Dec	Hex	Abbr
000 0000	0	00	NUL
000 0001	1	01	SOH
000 0010	2	02	STX
000 0011	3	03	ETX
000 0100	4	04	EOT
000 0101	5	05	ENQ
000 0110	6	06	ACK
000 0111	7	07	BEL
000 1000	8	08	BS
000 1001	9	09	HT
000 1010	10	0A	LF
000 1011	11	0B	VT
000 1100	12	0C	FF
000 1101	13	0D	CR
000 1110	14	0E	SO
000 1111	15	0F	SI
001 0000	16	10	DLE
001 0001	17	11	DC1
001 0010	18	12	DC2
001 0011	19	13	DC3
001 0100	20	14	DC4
001 0101	21	15	NAK
001 0110	22	16	SYN
001 0111	23	17	ETB
001 1000	24	18	CAN
001 1001	25	19	EM
001 1010	26	1A	SUB
001 1011	27	1B	ESC
001 1100	28	1C	FS
001 1101	29	1D	GS
001 1110	30	1E	RS
001 1111	31	1F	US

111 1111	127	7F	DEL

Binary	Dec	Hex	Glyph
010 0000	32	20	?
010 0001	33	21	!
010 0010	34	22	"
010 0011	35	23	#
010 0100	36	24	$
010 0101	37	25	%
010 0110	38	26	&
010 0111	39	27	'
010 1000	40	28	(
010 1001	41	29	)
010 1010	42	2A	*
010 1011	43	2B	+
010 1100	44	2C	,
010 1101	45	2D	-
010 1110	46	2E	.
010 1111	47	2F	/
011 0000	48	30	0
011 0001	49	31	1
011 0010	50	32	2
011 0011	51	33	3
011 0100	52	34	4
011 0101	53	35	5
011 0110	54	36	6
011 0111	55	37	7
011 1000	56	38	8
011 1001	57	39	9
011 1010	58	3A	:
011 1011	59	3B	;
011 1100	60	3C	<
011 1101	61	3D	=
011 1110	62	3E	>
011 1111	63	3F	?

Binary	Dec	Hex	Glyph
100 0000	64	40	@
100 0001	65	41	A
100 0010	66	42	B
100 0011	67	43	C
100 0100	68	44	D
100 0101	69	45	E
100 0110	70	46	F
100 0111	71	47	G
100 1000	72	48	H
100 1001	73	49	I
100 1010	74	4A	J
100 1011	75	4B	K
100 1100	76	4C	L
100 1101	77	4D	M
100 1110	78	4E	N
100 1111	79	4F	O
101 0000	80	50	P
101 0001	81	51	Q
101 0010	82	52	R
101 0011	83	53	S
101 0100	84	54	T
101 0101	85	55	U
101 0110	86	56	V
101 0111	87	57	W
101 1000	88	58	X
101 1001	89	59	Y
101 1010	90	5A	Z
101 1011	91	5B	[
101 1100	92	5C	\
101 1101	93	5D	]
101 1110	94	5E	^
101 1111	95	5F	_

Binary	Dec	Hex	Glyph
110 0000	96	60	`
110 0001	97	61	a
110 0010	98	62	b
110 0011	99	63	c
110 0100	100	64	d
110 0101	101	65	e
110 0110	102	66	f
110 0111	103	67	g
110 1000	104	68	h
110 1001	105	69	i
110 1010	106	6A	j
110 1011	107	6B	k
110 1100	108	6C	l
110 1101	109	6D	m
110 1110	110	6E	n
110 1111	111	6F	o
111 0000	112	70	p
111 0001	113	71	q
111 0010	114	72	r
111 0011	115	73	s
111 0100	116	74	t
111 0101	117	75	u
111 0110	118	76	v
111 0111	119	77	w
111 1000	120	78	x
111 1001	121	79	y
111 1010	122	7A	z
111 1011	123	7B	{
111 1100	124	7C	\|
111 1101	125	7D	}
111 1110	126	7E	~

If you look carefully at the ASCII representation of each character you might notice some patterns. For example:

Binary	Dec	Hex	Glyph
110 0001	97	61	a
110 0010	98	62	b
110 0011	99	63	c

As you can see, a = 97, b = 98, c = 99. This means that if we are told what value a character is we can easily work out the value of subsequent or prior characters.

Example: ASCII characters

Without looking at the ASCII table above! If we are told that the ASCII value for the character '5' is 011 0101, what is the ASCII value for '8'.

We know that '8' is three characters after '5', as 5,6,7,8. This means that the ASCII value of '8' will be three bigger than that for '5':

  011 0101  ASCII '5'
+      011
  --------  
  011 1000  ASCII '8'

Checking above this is the correct value.

If you are worried about making mistakes with binary addition, you can deal with the decimal numbers instead. Take the example where you are given the ASCII value of 'g', 110 0111, what is 'e'?

We know that 'e' is two characters before 'g', as e, f, g. This means that the ASCII value of 'e' will be two smaller than that for 'g'.

64 32 16  8  4  2  1
 1  1  0  0  1  1  1 = 103₁₀ = ASCII value of 'g'

103 - 2 = 101₁₀

64 32 16  8  4  2  1
 1  1  0  0  1  0  1 = 101₁₀ = ASCII value of 'e'

Exercise: ASCII

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'Z' is 90(base10), what is the letter 'X' stored as

Answer:

88 - as it is 2 characters down in the alphabet

How many ASCII 'characters' does the following piece of text use:

Hello Pete,
ASCII rocks!

Answer:

27 or 26. If you said 23 you'd be wrong because you must include the non-printing characters at the end of each line. Each end of line needs a EOL command, and a new line needs a carriage return (CR), making the text like so:

Hello Pete,[EOL][CR]
ASCII rocks![EOL]

For the Latin alphabet ASCII is generally fine, but what if you wanted to write something in Mandarin, or Hindi? We need another coding scheme!

Extension: Coding ASCII

You might have to use ASCII codes when reading from text files. To see what each ASCII code means we can use the folliwing function ChrW(x) which returns the ASCII code with denary value x. Try out the following code to see the first 128 characters. What is special about character 10?

For x = 0 To 127
  Console.WriteLine("ASCII for " & x & " = " & ChrW(x))
Next
Console.ReadLine()

Unicode

The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:

Chinese characters 汉字
Japanese characters 漢字
Cyrillic Кири́ллица
Gujarati ગુજરાતી
Urdu اردو

You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode. There are several versions of unicode, each with using a different number of bits to store data:

Name	Descriptions
UTF-8	8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters
UTF-16	16-bit, variable-width encoding, can expand to 32 bits.
UTF-32	32-bit, fixed-width encoding. Each character takes exactly 32-bits

With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:

code point	glyph*	character	UTF-16 code units (hex)
U+007A	z	LATIN SMALL LETTER Z	007A
U+6C34	水	CJK UNIFIED IDEOGRAPH-6C34 (water)	6C34
U+10000		LINEAR B SYLLABLE B008 A	D800, DC00
U+1D11E		MUSICAL SYMBOL G CLEF	D834, DD1E

You can find out more about unicode encoding on Wikipedia

Exercise: ASCII and Unicode

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as

Answer:

100 0111 - as it is 3 characters further on in the alphabet

The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as:

Answer:

110 1101 - as it is 6 characters down in the alphabet

Give a benefit of using ASCII:

Answer:

Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode

Give a benefit of using unicode over ASCII:

Answer:

ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages.

How many different characters can 7-bit ASCII represent?

Answer:

2^7 = 128

You are designing a computer system for use worldwide, what character encoding scheme should you use and why?

Answer:

unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic