Fundamentals of Data Representation: ASCII and unicode

From Wikibooks, open books for an open world
Jump to navigation Jump to search

PAPER 2 - ⇑ Fundamentals of data representation ⇑

← Character form of decimal digit ASCII and unicode Error checking →


ASCII[edit | edit source]

The 104-key PC US English QWERTY keyboard layout evolved from the standard typewriter keyboard, with extra keys for computing.

ASCII normally uses 8 bits (1 byte) to store each character. However, the 8th bit is used as a check digit, meaning that only 7 bits are available to store each character. This gives ASCII the ability to store a total of

2^7 = 128 different values.
The 95 printable ASCII characters, numbered from 32 to 126 (decimal)

ASCII values can take many forms:

  • Numbers
  • Letters (capitals and lower case are separate)
  • Punctuation (?/|\£$ etc.)
  • non-printing commands (enter, escape, F1)

Take a look at your keyboard and see how many different keys you have. The number should be 104 for a windows keyboard, or 101 for traditional keyboard. With the shift function valus (a, A; b, B etc.) and recognising that some keys have repeated functionality (two shift keys, the num pad). We roughly have 128 functions that a keyboard can perform.

Binary Dec Hex Abbr
000 0000 0 00 NUL
000 0001 1 01 SOH
000 0010 2 02 STX
000 0011 3 03 ETX
000 0100 4 04 EOT
000 0101 5 05 ENQ
000 0110 6 06 ACK
000 0111 7 07 BEL
000 1000 8 08 BS
000 1001 9 09 HT
000 1010 10 0A LF
000 1011 11 0B VT
000 1100 12 0C FF
000 1101 13 0D CR
000 1110 14 0E SO
000 1111 15 0F SI
001 0000 16 10 DLE
001 0001 17 11 DC1
001 0010 18 12 DC2
001 0011 19 13 DC3
001 0100 20 14 DC4
001 0101 21 15 NAK
001 0110 22 16 SYN
001 0111 23 17 ETB
001 1000 24 18 CAN
001 1001 25 19 EM
001 1010 26 1A SUB
001 1011 27 1B ESC
001 1100 28 1C FS
001 1101 29 1D GS
001 1110 30 1E RS
001 1111 31 1F US
111 1111 127 7F DEL
Binary Dec Hex Glyph
010 0000 32 20 ?
010 0001 33 21 !
010 0010 34 22 "
010 0011 35 23 #
010 0100 36 24 $
010 0101 37 25 %
010 0110 38 26 &
010 0111 39 27 '
010 1000 40 28 (
010 1001 41 29 )
010 1010 42 2A *
010 1011 43 2B +
010 1100 44 2C ,
010 1101 45 2D -
010 1110 46 2E .
010 1111 47 2F /
011 0000 48 30 0
011 0001 49 31 1
011 0010 50 32 2
011 0011 51 33 3
011 0100 52 34 4
011 0101 53 35 5
011 0110 54 36 6
011 0111 55 37 7
011 1000 56 38 8
011 1001 57 39 9
011 1010 58 3A :
011 1011 59 3B ;
011 1100 60 3C <
011 1101 61 3D =
011 1110 62 3E >
011 1111 63 3F ?
Binary Dec Hex Glyph
100 0000 64 40 @
100 0001 65 41 A
100 0010 66 42 B
100 0011 67 43 C
100 0100 68 44 D
100 0101 69 45 E
100 0110 70 46 F
100 0111 71 47 G
100 1000 72 48 H
100 1001 73 49 I
100 1010 74 4A J
100 1011 75 4B K
100 1100 76 4C L
100 1101 77 4D M
100 1110 78 4E N
100 1111 79 4F O
101 0000 80 50 P
101 0001 81 51 Q
101 0010 82 52 R
101 0011 83 53 S
101 0100 84 54 T
101 0101 85 55 U
101 0110 86 56 V
101 0111 87 57 W
101 1000 88 58 X
101 1001 89 59 Y
101 1010 90 5A Z
101 1011 91 5B [
101 1100 92 5C \
101 1101 93 5D ]
101 1110 94 5E ^
101 1111 95 5F _
Binary Dec Hex Glyph
110 0000 96 60 `
110 0001 97 61 a
110 0010 98 62 b
110 0011 99 63 c
110 0100 100 64 d
110 0101 101 65 e
110 0110 102 66 f
110 0111 103 67 g
110 1000 104 68 h
110 1001 105 69 i
110 1010 106 6A j
110 1011 107 6B k
110 1100 108 6C l
110 1101 109 6D m
110 1110 110 6E n
110 1111 111 6F o
111 0000 112 70 p
111 0001 113 71 q
111 0010 114 72 r
111 0011 115 73 s
111 0100 116 74 t
111 0101 117 75 u
111 0110 118 76 v
111 0111 119 77 w
111 1000 120 78 x
111 1001 121 79 y
111 1010 122 7A z
111 1011 123 7B {
111 1100 124 7C |
111 1101 125 7D }
111 1110 126 7E ~

If you look carefully at the ASCII representation of each character you might notice some patterns. For example:

Binary Dec Hex Glyph
110 0001 97 61 a
110 0010 98 62 b
110 0011 99 63 c

As you can see, a = 97, b = 98, c = 99. This means that if we are told what value a character is we can easily work out the value of subsequent or prior characters.

Example: ASCII characters

Without looking at the ASCII table above! If we are told that the ASCII value for the character '5' is 011 0101, what is the ASCII value for '8'.

We know that '8' is three characters after '5', as 5,6,7,8. This means that the ASCII value of '8' will be three bigger than that for '5':

  011 0101  ASCII '5'
+      011
  --------  
  011 1000  ASCII '8'

Checking above this is the correct value.

If you are worried about making mistakes with binary addition, you can deal with the decimal numbers instead. Take the example where you are given the ASCII value of 'g', 110 0111, what is 'e'?

We know that 'e' is two characters before 'g', as e, f, g. This means that the ASCII value of 'e' will be two smaller than that for 'g'.

64 32 16  8  4  2  1
 1  1  0  0  1  1  1 = 10310 = ASCII value of 'g'

103 - 2 = 10110

64 32 16  8  4  2  1
 1  1  0  0  1  0  1 = 10110 = ASCII value of 'e'
Exercise: ASCII

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'Z' is 90(base10), what is the letter 'X' stored as

Answer:

88 - as it is 2 characters down in the alphabet

How many ASCII 'characters' does the following piece of text use:

Hello Pete,
ASCII rocks!

Answer:

27 or 26. If you said 23 you'd be wrong because you must include the non-printing characters at the end of each line. Each end of line needs a EOL command, and a new line needs a carriage return (CR), making the text like so:

Hello Pete,[EOL][CR]
ASCII rocks![EOL]

For the Latin alphabet ASCII is generally fine, but what if you wanted to write something in Mandarin, or Hindi? We need another coding scheme!

Extension: Coding ASCII

You might have to use ASCII codes when reading from text files. To see what each ASCII code means we can use the folliwing function ChrW(x) which returns the ASCII code with denary value x. Try out the following code to see the first 128 characters. What is special about character 10?

For x = 0 To 127
  Console.WriteLine("ASCII for " & x & " = " & ChrW(x))
Next
Console.ReadLine()

Unicode[edit | edit source]

The problem with ASCII is that it only allows you to represent a small number of characters (~128 or 256 for Extended ASCII). This might be OK if you are living in an English speaking country, but what happens if you live in a country that uses a different character set? For example:

You can see that we quickly run into trouble as ASCII can't possibly store these hundreds of thousands of extra characters in just 7 bits. What we use instead is unicode. There are several versions of unicode, each with using a different number of bits to store data:

Name Descriptions
UTF-8 8-bit is the most common unicode format. Characters can take as little as 8-bits, maximizing compatibility with ASCII. But it also allows for variable-width encoding expanding to 16, 24, 32, 40 or 48 bits when dealing with larger sets of characters
UTF-16 16-bit, variable-width encoding, can expand to 32 bits.
UTF-32 32-bit, fixed-width encoding. Each character takes exactly 32-bits

With over a million possible characters we should be able to store every character from every language on the planet, take a look at these examples:

code point glyph* character UTF-16 code units (hex)
U+007A z LATIN SMALL LETTER Z 007A
U+6C34 CJK UNIFIED IDEOGRAPH-6C34 (water) 6C34
U+10000 LINEAR B SYLLABLE B008 A D800, DC00
U+1D11E MUSICAL SYMBOL G CLEF D834, DD1E

You can find out more about unicode encoding on Wikipedia

Exercise: ASCII and Unicode

Without using the crib table (you won't get it in the exam!) answer the following questions:

The ASCII code for the letter 'D' is 100 0100, what is the letter 'G' stored as

Answer:

100 0111 - as it is 3 characters further on in the alphabet

The ASCII code for the letter 's' is 111 0011, what is the letter 'm' stored as:

Answer:

110 1101 - as it is 6 characters down in the alphabet

Give a benefit of using ASCII:

Answer:

Each character only takes up 8 bits, meaning that storing data in ASCII may take up less memory than unicode

Give a benefit of using unicode over ASCII:

Answer:

ASCII stores a much smaller character set than unicode, meaning that you are limited to the Latin character set and cannot represent characters from other languages.

How many different characters can 7-bit ASCII represent?

Answer:

2^7 = 128

You are designing a computer system for use worldwide, what character encoding scheme should you use and why?

Answer:

unicode as it would allow you to display non Latin character sets such as Hindi and Cyrillic