Fundamentals of data representation: Information coding systems
ASCII and Unicode
A standard method for the representation of all the keyboard characters, including the numbers, and other commonly used functions in ASCII or the American Standard Code for Information Interchange. The most recent version is an 8-bit code allowing for 256 characters.
The limitations of ASCII:
- 256 characters are not sufficient to represent all of the possible characters, numbers and symbols.
- It was initially developed in English and therefore did not represent all of the other languages and scripts in the world.
- Widespread use of web made it more important to have universal international coding system.
- The range of platforms and program has increased dramatically with more developers from around the world using a much wider range of characters.
As a result, a new standard called Unicode has emerged which follows the same basic principles as ASCII in that in one of its forms it has a unique 8-bit code for every keyboard character on a standard English keyboard.
ASCII codes have been subsumed within Unicode meaning that the ASCII code for capital letter A is 65 and so is the Unicode code for the same character. Unicode also includes international characters or over 29 countries and even includes conversions of classical and ancient characters.
To represent these extra characters it is obviously necessary to use more than 8 bits per character and there are two common encodings of Unicode in use today (UTF-8 and UTF-16). As the name suggests the latter is a 16-bit code.
Error checking and correction
A parity bit is a method of detecting errors in data during transmission. When you send data, it is being sent as a series of 0s and 1s.
In the figure above, a Unicode character is transmitted as the binary code 0111000110101011. It is quite possible that this code could get corrupted as it passed around either inside the computer or across a network.
|Parity can only detect odd numbers of errors and cannot repair the damaged bits.|
In the top example the parity bit is set to 0 to maintain an even number of ones. One method for detecting errors is to count the number of ones in each byte before the data is sent to see whether there is an even or odd number. At the receiving end, the code can be checked to see whether the number is still odd or even.
|We have to remember to count 1s not 0s and the parity bit is normally put as the MSB.|
Majority voting is another method of identifying errors in transmitted data. In this case each bit is sent three times. So the binary code 1001 would be sent as:
When data is checked, you would expect to see patterns of three bits. In this case, it is 111 for the first bit, then 000 and so on. Where there is a discrepancy, you can use majority voting to see which bit occurs the most frequently. For example, if the same code 1001 was received as:
You can assume that the first bit should be 1 as two out of three of the three bits are 1 and that the second bit is 0 as two of the three bits are 0. The last two bits are 0 and 1 as there appears to be no errors in this part of the code.
|Majority vote can repair errors but if two errors are made on the same bit then it will not be detected. And of course three times as much data must be transmitted.|
Like a parity bit, a check digit is a value that is added to the end of a number to try and ensure that the number is not corrupted in any way. The check digit is created by taking the digits that make up the number itself and processing them in some way to create a single digit. The simplest but most error-prone method is to add the digits of the number together, and keep on adding the digits until only a single digit remains.
So the digits of 123456 add up to 21 and 2 and 1 in turn add up to 3, so the number with its check digit becomes 1234563. When the data is being processed the check digit is recalculated and compared with the digit that has been transmitted. Where the check digit is the same then it is assumed that the data is corrected. Where there is a discrepancy, an error message is generated.
- Binary codes can be used to represent text, characters, numbers, graphics, video and audio.
- ASCII and Unicode are systems for representing characters.
- It is possible that the data can get corrupted at any point when it is being either processed or transmitted.
- Error detection and correction methods include check digits and majority voting.
A character code uses a unique number/code to represent each different character
b = 1100010 e = 1100101
ASCII uses 7 or 8 bits per character and represents only Latin characters and extended symbols. Unicode uses 16 bits per character and can represent any character and language.
Unicode contains ASCII as a subset so every ASCII character can also be stored in Unicode. ASCII characters have the same character codes as they do in Unicode.
Parity checks are quick and relatively cheap in terms of data transmission, but only detect single errors and cannot repair data; check digits require a lot of processing but detect any number of errors, they cannot repair data; majority vote can catch a lot of errors and requires little processing; it can repair errors but takes three times the amount of data for transmission.
The character has been received correctly as there are an odd amounts of 1s.
0 0 1 1 1 0 0 1.