Data Compression/Coding

From Wikibooks, open books for an open world
Jump to: navigation, search


Most people think that compression is mostly about coding. Well at one time it was. Before information theory, people spent years developing the perfect code to store data efficiently. To understand the limits of coding as a compression mechanism, we have to understand what coding is.

The ASCII binary translation of the word Wikipedia morphs into the English translation.

The computer really understands only one type of data, strings of 0s and 1s. You know 0 is off, 1 is on, and all that rot. These actually have to do with the state of a memory cell, where 0 is a low voltage output of the cell, and 1 is a slightly higher voltage. The actual voltage spread involved is getting smaller as the size of the circuits gets smaller and the voltage needed to jump the gap between parallel circuits gets smaller.

Hardware configuration used to define the size of a code, by limiting the number of bits as the two state memory element is called, that could be moved at a single pass. An example is the 4bit 8bit, 16bit 32bit, 64bit, and now 128bit architectures found during the development of microcomputers.

Now essentially everything in a computer is stored as sequences of bits. The storage capacity of the computer is limited by how many of those bits, are wasted. Ideally we don't want to waste bits, so we tend to use codes that minimize the number of bits that are wasted for any particular architecture of computer they will run on. The ultimate code is the code that covers the exact amount of variations that are possible in a data type, with the least amount of wasted bits.

It is interesting to find that this is not a consideration in the human brain, in fact we have had to define a term to explain the redundant information carried in the codes for the brain, and that term is degenerate coding, a pejorative term that really just means that there is more than one code for the same information.

Consider a string of numbers, that mean something. Are they a string of digits, in which case we want to code them to minimize the storage for each digit, An integer value, in which case we want to code them to minimize the storage for an arbitrarily sized integer, A floating point value, in which case we want to code them to minimize the storage for an arbitrarily sized mantissa, and include information on what power the number is raised to, and how many bits past the point we are going to store. Or an address string in which case we want to code it to minimize the number of bits it takes for a character in a specific language.

To a human all of these different types of storage look the same on the printed page. But it takes time to convert between the different types of storage, so you might not want to translate them into the tightest coding possible while you are using them. However later when you want to store them away for posterity, it looks like a good idea to compress them down to the smallest size possible. That is the essence of compression.

limits to coding density[edit]

So the limit to coding density, is determined by the type of data, you are trying to code and also by the amount of information that is available from a separate location.

As an example, the military have the practice to limit radio chatter during a campaign, this is often considered good thing. Even taking it to the point of limiting themselves to a click on their mike, to trigger the next phase of the attack. To a person who is fully briefed on it, a click on a mike will be all that is needed to send a wealth of information about the state of the operation. The event will have no meaning to the enemy waiting in the trenches, so that the first evidence they have of attack is when guns start going off.

Ideally all storage could be achieved in one bit files that contained whole encyclopedias. In actuality that is not likely to happen. Coding is still important, but it is very limited in what it can now do for compression. New algorithms for compression are increasingly hard to come by, hence the path of specialization on the source data new approaches take, true advances have recently been reduced to the hardware where the code will run.

As a side note, regarding information density, consider the mathematical theory that every thing can be found represented in some form at a specific section of Pi (a mathematical constant, whose decimal representation never ends or repeats). This would permit that if a fast enough computation of the number was possible, the simple information regarding the boundaries of the significative section would serve to represent an extremely large block of information. That would indeed be information with a large density ratio.

bits, bytes, symbols, pixels, streams, files[edit]

Suffice it to say that real compression algorithms are a complex technology, bordering on an art form.

—Leo A. Notenboom, [1]

In the field of data compression, it is convenient to measure the "size" of data in terms of the fundamental unit of data: bits.

3 basic common primitive types char,short int,long int.

Many data compression algorithms produce a compressed data stream that is a stream of bits -- with no particular alignment to any other size.

It is sometimes convenient to consider the input data in terms of "symbols". Some algorithms compress English text in terms of the symbols from an input and process them converting them in a reversible representation on the compressed output. Symbols are usually sets of bits but they in turn can represent any type of character representation/encoding, even pixels or waveforms.

When compressing CD-quality audio, the symbols are the 2^16 possible amplitude levels -- the 2^16 possible physical positions of the speaker cone.

Some file compression utilities consider the uncompressed file in terms of 257 different symbols. At any point in the file, the "next symbol" could be any one of 257 symbols: one of the 256 possible bytes -- or we could have already reached the end of the file, the 257th symbol indicating "end-of-file".

When comparing image compression routines, sometimes the term "bpp" (bits per pixel) is used. An uncompressed full-color bitmap image is 24 bpp. An uncompressed 2 color bitmap image (containing only black pixels and white pixels) is 1 bpp. Some image compression algorithms can compress some images to much less than 0.1 bpp. The same image compression algorithm may be doing pretty good to compress some other image to 7.5 bpp.

fixed-length coding[edit]

variable-length coding[edit]

With variable-length coding, we can make some symbols very short -- shorter than any fixed-length encoding of those symbols. This is great for compressing data.

However, variable-length coding almost always end up make other symbols slightly longer than best fixed-length encoding of those symbols.

Using more bits store a symbol sounds ridiculous at first -- Why don't we simply store a shorter fixed-length version of such symbols?

OK, yeah, *other* symbols could be shorter. So ... why can't we use the variable-length version when that would be shorter, and use the fixed-length version when that would be shorter? The best of both worlds? Perhaps use a "1" prefix to indicate "the fixed-length representation follows" ?


To do:
Put a simple intuitive answer to the above questions here. Hold off on abstract mathematical proofs until we get to the entropy coding section.

We discuss variable-length coding in more detail in the entropy coding section of this book.

Representing integers[edit]

Digital information can be encoded as a series of integers. With modern digital electronics, we typically first (1) encode the data (etc.) in some relatively simple (but verbose) encoding as a series of integers, then (2 & 3) data compression algorithms take that series of integers, and find some other way to represent the entire series in fewer bits.

For example, when we store music in a MP3 file, the machine often first (1) digitizes the analog sound with a microphone and an ADC as a (relatively simple, but verbose) 16-bit PCM encoding at 44.1 kHz sampling rate per channel (Compact Disc digital audio, uncompressed WAV audio, etc.). Then the software processes that raw data first to (2) transform it into a representation that requires fewer integers, then (3) represents each of those integers as bits in a way that the entire song can be stored in relatively few bits.

(In the early days of computers, people spent a lot of time figuring out how to encode music (etc.) into a reasonable number of keystrokes, both compressing and encoding the data "manually" with human brainpower before entering it into the computer).

People have invented a surprisingly large number of ways of representing integers as bits.

... (say a few words here about 32-bit unsigned integers) ...

... (say a few words here about 2's complement signed integers) ...

Floating-point numbers are often stored as pairs of signed integers -- the significand (the significant digits) and the exponent.

fixed-length code[edit]

A fixed-length code is the simplest way to represent a series of integers.


comma code[edit]

A comma code is one way to represent a series of arbitrarily long integers.

Each integer is represented by a series of digits.

A special symbol -- called the comma -- is used between each code word, to mark the end of one integer and the beginning of the next.


SCDC is one way to represent a series of arbitrarily long integers.

The (s,c) codes, also called (s,c)-dense codes (SCDC), have lengths that are multiples of 8 bits. [1]

SCDC has a single parameter s chosen as 0 < s < 256. Each codeword consists of a sequence of 8-bit bytes, where the last byte in the codeword has a value less than s, and the other bytes (if any) have a value greater than or equal to s.

The end-tagged dense code (ETDC), also called variable-length quantity (VLQ), is the special case of SCDC for which s=128.

Because every SCDC codeword is aligned on byte boundaries, SCDC decompression is simpler and faster than Huffman decompression.[1]

Both SCDC and Fibonacci codes support direct search in the compressed file.[2][1]

ZigZag encoding[edit]

ZigZag encoding is one way to represent signed integers. It maps "small" signed integers (close to zero) to "small" unsigned integers (the most-significant bits are all zero).[3][4] It's designed to have a one-to-one mapping between 32-bit ZigZag-encoded signed integers and 32-bit 2's complement signed integers. ZigZag encoding (like delta encoding) doesn't save any space by itself, but it often transforms data in such a way that other downstream compression algorithms work better. (Delta encoding followed by ZigZag encoding often produces better results -- with the same downstream compression algorithm -- than either one alone).[5]

   encoded     integer
   0000 0005 : -3
   0000 0003 : -2
   0000 0001 : -1
   0000 0000 :  0
   0000 0002 : +1
   0000 0004 : +2
   0000 0006 : +3

  1. a b c Shmuel T. Klein; and Miri Kopel Ben-Nissan. [ "On the Usefulness of Fibonacci Compression Codes"]. 2009.
  2. Shmuel T. Klein; and Dana Shapira. "Practical Fixed Length Lempel Ziv Coding".
  3. Google Developers. "Protocol Buffers :Encoding".
  4. "protobuf-c.c"
  5. "Delta zigzag encoding binary packing"