Data Compression/Order/Entropy
From Wikibooks, the open-content textbooks collection
By taking the consistencies out of a storage element, such as the redundancies, similar components, longest most used words, etc, Compression increases the entropy of the file, until there is so much entropy in the file that it can't be further compressed.
At the same time, the decompression algorythm adds more and more consistency and order to the entropic file, until it means something. One of the reasons that compression does not make a good form of encryption, is that there is information hidden in the entropy of the file, that indicates to the decompression program how to recover it. In fact in extreme compression the amount of data needed to recover the compressed text, is often a greater percentage of the file, than the actual remaining data that has yet to be compressed.
Compression at some point becomes how do you further compress already entropic data, that consists mostly of instructions to decompress a minimal amount of data?
There are two approaches that might be tried, the first approach is to inject order, without increasing the length of the file, so that you can recover more compression, and the other approach is much more controversial because it suggests the injection of entropy, without changing the length of the file, so that you can recover more compression.
The latter approach is based on a little known scientific principle first explained in "A New Kind of Science", which suggested that entropy and order are cyclical at least in incremental or decremental type IV cellular automata. This approach which might explain self-organization despite the Second Law of Thermodynamics would mean that sometimes when enough entropy is injected into a data field, the data self-organizes into an orderly state again. If we can keep track of the entropic states that it has gone through to get to the self-organization level, we can compress the new order, with some hope of recovering the original state.
Contents |
[edit] entropy coding
statistical symbol compression
[edit] Huffman encoding
[edit] improving on Huffman
The original Huffman algorithm is optimal for symbol-by-symbol coding of a stream of unrelated symbols with a known probability distribution.
However, several algorithms can be used to compress some kinds of files smaller than the Huffman algorithm. These algorithms exploit one or another of the caveats in the Huffman optimality proof:
- unrelated symbols: most real data files have some mutual information between symbols. One can do better than plain Huffman by "decorrelating" the symbols, and then using the Huffman algorithm on these decorrelated symbols.
- symbol-by-symbol: Several compression algorithms, such as range coding, give better compression than binary Huffman by relaxing the binary Huffman restriction that each input symbol must be encoded as an integer number of bits.
- known probability distribution: Usually the receiver does not know the exact probability distribution. So typical Huffman compression algorithms send a frequency table first, then send the compressed data. Several "adaptive" compression algorithms can get better compression than Huffman because they converge on the probability distribution, or adapt to a changing probability distribution, without ever explicitly sending a frequency table.
[edit] DEFLATE
DEFLATE combines LZ77 with Huffman encoding. It is specified in RFC 1951.
It is extremely widely used to decode ZIP files and gzip files and PNG image files.
[edit] zlib
The "zlib compressed data format specification" is a standardized way of packaging compressed data in a file, with a simple header that indicates the type of compression and helps detect the most common kinds of data corruption. It is often used to store data compressed with DEFLATE with a window size up to 32K.
The "zlib compressed data format specification" is specified in RFC 1950.
[edit] arithmetic range coding
[edit] universal codes
There are many "universal codes", a kind of prefix code.
Any particular prefix code (including any universal code) can be seen as a kind of Huffman code for a particular corresponding frequency distribution.
Typical audio data, and typical audio data prediction residues, have approximately Laplacian distribution. Encoding a Laplacian distribution with a general Huffman code gives a prefix code equivalent to using a Golomb code. The Golomb code is much simpler and faster than a general Huffman code to compress and decompress. If the audio data is "close enough" to a Laplacian distribution, a Golomb code may give even better net compression than a general Huffman code, because a Golomb decoder requires a smaller overhead of 1 parameter, while general Huffman decoder requires the overhead of a Huffman table of probabilities.
Rice coding is a subset of Golomb coding that is much faster to compress and decompress. Because Rice codes closely approximate Huffman codes for a Laplacian distribution, they compress almost as well as general Huffman codes.
[edit] audio compression
fourier transform
perceptual coding
"The Gerzon/Craven theorems show that the level of the optimally de-correlated signal is given by the average of the original signal spectrum when plotted as dB versus linear frequency. ... this dB average can have significantly less power than the original signal, hence the reduction in data rate. ... In practice, the degree to which any section of music data can be ‘whitened’ depends on the content and the complexity allowed in the prediction filter." [1]
Typical audio data, and typical audio data prediction residues (after decorrelating), have approximately Laplacian distribution, and so they are often decoded with a Golomb decoder or Rice decoder rather than a general Huffman decoder.
Some audio decoders can be told to use Rice decoders during passages of music that compress well with them, and later told to switch to some other decoder during passages of music (such as pure sine wave tones, or all values equally probable white noise) that don't compress well with Rice decoders.
In some audio applications, in particular 2-way telephone-like communication, end-to-end latency is critical. The recommended maximum time delay for telephone service is 150 milliseconds[citation needed](Wikipedia:1 E-1 s). This rules out many popular compression algorithms. For example, a standard Huffman compressed block size of 150 milliseconds or longer won't work. Standard Huffman compression requires analyzing an entire block before sending out the compressed prefix code for the first symbol in the block. In this case the Huffman compressor uses up all the time allowed waiting for the block to fill up, leaving no time for time-of-flight transmission or decoding. (A block size of 150 milliseconds or longer may work fine with an adaptive decoder).
In other audio applications, end-to-end latency is irrelevant. When compressing a song for later distribution, or the soundtrack of a movie, the compressor typically has the whole thing available to it before it begins compressing. Such applications may use low-latency algorithms when they are adequate; but they also allow other algorithms to be used that may give better net compression or a lower peak bit rate.
[edit] References
- ↑ "MLP Lossless Compression" by JR Stuart, PG Craven, MA Gerzon, MJ Law, RJ Wilson 1999. section 4.2 "Prediction".