- 1 Entropy
- 2 entropy coding
- 3 Something about trees
- 4 Dynamic Huffman: storing the frequency table
- 5 prefix codes
- 6 audio compression
- 7 arithmetic coding
- 8 combination
- 9 References
Entropy is a measure of unpredictability. Understanding entropy not only helps you understand data compression, but can also help you choose good passwords and avoid easily-guessed passwords.
For example, 1000 bits of data representing 1000 consecutive tosses of a fair coin has an entropy of 1000 bits, since there is no way to predict whether "heads" or "tails" will come up next. 1000 bits of data representing 1000 consecutive tosses of an unfair two-headed coin has an entropy of zero (0) bits, since the coin will always come up heads.
The entropy of a message is in a certain sense a measure of how much information it really contains.
By taking the consistencies out of a storage element, such as the redundancies, similar components, longest most used words, etc.,
Compression increases the entropy of the file, until there is so much entropy in the file that it can't be further compressed. Compression when it shrinks a file, can lead to a higher entropy density per length of file. Which makes a properly compressed file when followed by an encryption pass harder to break in ciphertext only attacks than the file encrypted without the compression done first. See:
Of course it does not help with most plain text attack modes if attacker gets to pick the plain text you encrypt.
At the same time, the decompression algorythm adds more and more consistency and order to the entropic file, until it means something.
One of the reasons that compression does not make a good form of encryption, is that there is information hidden in the entropy of the file, that indicates to the decompression program how to recover it. In fact in extreme compression the amount of data needed to recover the compressed text, is often a greater percentage of the file, than the actual remaining data that has yet to be compressed.
Entropy is always relative to some model. It is not a property of a particular string of bits that you are given, but of the set of bitstrings that you could plausibly have been given.
Entropy is always relative to some model. It is not a property of a particular string of bits that you are given, but of the set of bitstrings that you could plausibly have been given. For a given set of bits, the model that gives the lowest entropy is the model that best predicts those bits and therefore compresses the data the smallest. Most compressed file formats, store (a) what model we used, (b) the particular parameters of that model (if any), and (c) the data compressed according to that model. Some kinds of models have a lot of "parameters" and require a lot of space to store, so often we use completely unrealistic models -- but ones that require very little space to describe -- in order to reduce the net compressed file size (model description + compressed data).
One of the simplest models is "Every byte of the file was independently and randomly generated by throwing darts at a wall while blindfolded. Some letters occur more frequently because they have a larger area on the wall." This is the order-0 Markov model, also known as the Shannon entropy when each byte is considered an independent message. Using that model, the entropy of any particular letter in a file F is
(This is the number of bits required to represent that letter using an entropy coder)
And the entropy of the entire file is the sum of the entropy of each letter in the file, (the number of bits required to represent the entire file is the sum of the number of bits required to represent each letter in that file)
In terms of the number of unique possible messages n (any particular letter in the file is one of a list n possible letters, from , any of which may occur 0, 1, or possibly N times)
- is some unique letter, for example 'e'
* is the number of times that unique letter x_i occurs in the file
- is the probability of the symbol (e.g. the character 'e') in the message
- is the number of times the letter at position i in the file occurs in the file (obviously this is at least 1)
- is the length of the file (the total number of messages in the file)
- is the number of possible messages, often n=257
- is the probability that some letter k occurs.
Often some particular letter x_i in the character set never occurs anywhere in some particular file -- the way the letter 'e' never occurs in Gadsby (1939). There can be any number of such letters that never actually occur. Letters that never occur make no difference to the entropy of any particular letter that does occur, or to the total of them, which is the entropy of the entire file under the order-0 model. When some letter x_i exists in the character set, but it is never actually used in some particular file, that letter's frequency is zero, its probability is zero, and the value is effectively zero.
Any arbitrary file, when analyzed using a order-1 Markov model ("first order model"), will give a number for the entropy that is smaller or the same as the order-0 entropy.
We discuss order-1 Markov and higher-order Markov models in more detail later in this book (Markov models).
How do you further compress data
Compression at some point becomes how do you further compress already entropic data, that consists mostly of instructions to decompress a minimal amount of data?
There are two approaches that might be tried, the first approach is to inject order, without increasing the length of the file, so that you can recover more compression, and the other approach is much more controversial because it suggests the injection of entropy, without changing the length of the file, so that you can recover more compression.
The latter approach is based on a little known scientific principle first explained in "A New Kind of Science", which suggested that entropy and order are cyclical at least in incremental or decremental type IV cellular automata. This approach which might explain self-organization despite the Second Law of Thermodynamics would mean that sometimes when enough entropy is injected into a data field, the data self-organizes into an orderly state again. If we can keep track of the entropic states that it has gone through to get to the self-organization level, we can compress the new order, with some hope of recovering the original state.
statistical symbol compression
For each prefix code in the compressed data, a Huffman decompressor looks up the compressed symbol in a table and emits the corresponding plaintext symbol.
In the following examples, we assume the plaintext symbol is some 8 bit byte. (Most Huffman decompressors have 257 symbols in their alphabet: the 256 possible bytes, and a special 257th code that indicates "end-of-block". Some Huffman decompressors have additional compressed symbols that decompress to common byte pairs or entire words, which can lead to smaller compressed file sizes). 
Where does the translation table of code-to-letters come from? The particular source chosen is perhaps the biggest difference between one Huffman compressed file format and another.
- fixed Huffman codes
- static Huffman codes (sometimes called "semi-adaptive Huffman") (The RFC 1951 definition of DEFLATE calls this "dynamic Huffman codes")
- adaptive Huffman codes
(Yes, it's confusing to use "static" as a synonym for "dynamic". Is it too late to change this now?)
When a decompressor uses a fixed Huffman code, it uses a translation table hard-wired into the decompressor that never changes.
When a decompressor uses a static Huffman code, it reads the translation table at the beginning of the compressed block. Then it uses that data to decode the rest of the block. Most of the "optimality proofs" of Huffman assume this sort of static code.
Often one block of text will have slightly different letter frequencies from another, even inside the same file. Most static Huffman decompressors watch for a special 257th "end-of-block" symbol, and then discard the old Huffman table and read in a new static compressed table. The decompressor uses that table to decode the new block.
DEFLATE and some other decompressors put a few bits of "model selector" metadata at the beginning of every block. That commands the decompressor to either
- (a) fetch the static Huffman table metadata immediately after the model selector;
- (b) copy the default fixed Huffman table stored inside the decompressor;
- (c) copy the raw data without trying to decompress it -- useful for "uncompressible" data; or
- (d) use some other compression algorithm entirely.
When a decompressor uses an adaptive Huffman code, it begins with some default translation table. However, it updates that table with every symbol that goes through.
getting the compressor and the decompressor to use the same codewords
To get lossless compression, when the compressor represents the byte "e" with the bitstring "010", the decompressor has to somehow know that the bitstring "010" must be decoded into the byte "e".
There are several ways for a Huffman decompressor to "know" the symbol-to-bitstring map (sometimes called a "frequency table") used by the corresponding Huffman compressor:
- fixed Huffman codes, hard-wired into the decompressor. Avoids sending a frequency table, but can give poor compression -- perhaps even expanding the file -- if the actual probability distribution is significantly different than the assumed probability distribution. The estimated probability distribution changes only when the compressor software changes. This has the lowest latency -- the first few symbols can be immediately encoded and transmitted.
- one dynamic Huffman code, with a single frequency table sent before the file. This requires the compressor to store the entire file, which may be arbitrarily long, while counting symbol frequencies, before generating and sending the first compressed codeword -- this has the worst latency. The estimated probability distribution changes once per file.
- several dynamic Huffman codes, with the file broken up into several blocks and a frequency table sent before each block. This has less latency than one-table-per-file. In cases where the probability distribution changes significantly from one block to the next, the total number of bits needed to store all the codewords can be much less -- but the overhead of the number of bits to send the tables can be much more. Also, the compressor can spend a lot of time trying to squeeze out a few more bits by "optimizing" where exactly to split the file. The estimated probability distribution changes once per block.
- A few file formats (such as bzip2), rather than always sending a complete frequency table that indicates the exact frequency of the next block, sometimes send a short reference to one of the previously-sent frequency tables that (hopefully) is close enough to the frequency of the next block that the expansion (if any) of the codewords in that block is more than paid back by not having to send a complete frequency table.
- adaptive Huffman compression: The estimated probability distribution changes (slightly) once per symbol. The probability distribution of the letters that have streamed past is used to estimate the future probability distribution. This has the second-lowest latency -- the first few symbols can be more-or-less immediately transmitted -- but the decompressor does more work to update the tables on each symbol.
Per-file and per-block dynamic Huffman codes can guarantee that certain symbols never occur in a block, and so some people write code that entirely eliminates those never-occurs symbols from the tree, reducing the number of bits in at least one codeword that does occur in the compressed text.
Many compression algorithms (including fixed Huffman codes and adaptive Huffman compression) estimate probabilities in a way that cannot guarantee that some symbol will never occur. In particular, if some symbol is so rare that it has not yet occurred in the training text -- it has zero frequency so far -- it is tempting to assign an estimated zero probability to that symbol. That leads to The zero-frequency problem.
Canonical Huffman code
A canonical Huffman code is a kind of Huffman code that can be very compactly described. That makes it super-wonderful for static Huffman compression. A large text file may have thousands of sections, each with a slightly different letter frequency. Every unnecessary byte in the format we choose for storing the description of the Huffman code table bloats that file by thousands of unnecessary bytes.
For example, the article on x-rays has a much higher frequency of the letter x than the previous article on yellow-bellied sapsuckers. Giving each section its own optimized Huffman table minimizes the number of bits required to store that section, if we pretend that the Huffman table overhead is not significant. Alas, that Huffman table overhead is often significant. Many compressors will spend a lot of time thinking about whether the total file will be shorter if we (a) use a separate Huffman table for each of several consecutive sections or (b) merge consecutive sections together into a single block and use a single compromise Huffman table for the whole block.
"drop some trailing ones, and then increment"
The decompressor needs two things to regenerate a block of plaintext: the exact Huffman code used by the compressor, and the series of compressed codewords.
Given any one Huffman code, one can construct many other equivalent codes, all of which have exactly the same codeword length for any particular plaintext symbol, but vary in the exact bit pattern assigned to each plaintext symbol. Since any particular plaintext symbol is assigned a codeword of a certain length, a length that is the same for every equivalent code, no matter which particular code you choose, any series of plaintext symbols will be compressed into a series of codewords of exactly the same total length. If the overhead of the prefix code description were insignificant, then it would not matter which of these many equivalent codes you would use -- just pick one at random. All equivalent codes compress that data to the same number of bits.
However, rather than picking one of them randomly and sending that to the decompressor, we can save bits in the prefix code description by picking one particular canonical Huffman code.
Given only the length, in bits, of the codeword for each symbol of any prefix code, we can construct the canonical Huffman code that gives every symbol the same length as that original prefix code. What makes a code a canonical Huffman (CH) code is that code words are lexicographically ordered by their code lengths; A given set of probabilities does not result in a unique CH code; one is free to reorder symbols that have the same code length.
The steps to create a canonical Huffman (CH) code are as follows:
- Construct a Huffman tree and throw out all information other than the symbols and the lengths of the codewords.
- Sort the symbols by increasing code lengths.
- Store the number of symbols with each code length along with the sorted symbols themselves.
Something about trees
Every prefix code can be thought about as analogous to a tree.
Every leaf on the tree is tagged with a plaintext letter or other symbol.
The length, in bits, of each codeword is sometimes called the path length of the symbol.
Dynamic Huffman: storing the frequency table
(This is a continuation of "getting the compressor and the decompressor to use the same codewords", for the common case of dynamic Huffman).
Inside a typical compressed file, there is a header before each block of compressed data. For blocks compressed with dynamic Huffman, that header for that block tells the decompressor the set of codewords used in that block, and the mapping from codewords to plaintext symbols.
Typically the compressor and the decompressor have some hard-coded alphabet of S plaintext symbols they need to handle. Often S=258 symbols, the 256 possible bytes and a few other special symbols such as the end-of-block symbol. Each symbol in the alphabet is represented, in the compressed data, by some codeword with a length in bits from 1 bit to some MAX_CODEWORD_LENGTH bits.
Many simple Huffman file formats simply store that list of 258 bitlengths as a block of 258 bytes. Each position represents some plaintext symbol -- even ones that are never used in the plaintext -- typically position 0 represents the 0 byte, position 32 (0x20) represents the space character, position __ (0x65) represents letter 'e', position 255 represents the 255 byte, position 256 represents some special command, and position 257 represents the end-of-block indicator.
The number at each position represents the length, in bits, of the codeword corresponding to that plaintext symbol. (Since we use canonical Huffman codes, the length of the codeword corresponding to each plaintext symbol is all the decompressor need to regenerate the actual set of codewords used in this block). Usually the special "length" of "0 bits" is reserved to indicate letters that never occur in this block, and so should not be included in the Huffman tree.
That list of S numbers is often called the "frequency table". Sometimes that list of S numbers is called the "Kraft vector", because it must satisfy w: Kraft's inequality.
That block of 258 bytes in the header represents a lot of overhead, especially if the compressed file includes many separate blocks, each with its own header. Some Huffman compressors, such as English-word-oriented Huffman compressors, use vastly more symbols (one for each word in the English dictionary it uses), requiring a much longer Kraft vector. There are a variety of ways to get smaller compressed file sizes by designing the file format in a way that reduces the size of this block, including:
- Use exactly the same codeword-to-plaintext-symbol dynamic Huffman mapping, but store that mapping in a more compact form:
- use some kind of length-limited Huffman compression. If there are no codewords longer than 15 bits, then we can store each length in 4 bits, rather than a full 8 bit byte.
- compress the frequency table in isolation, perhaps using RLE or some variable-length code, such as fixed Huffman compression or even another dynamic Huffman "pre-table".
- When a datafile alternates between two kinds of data -- for example, English text and long tables of numbers -- use simple differential compression to compactly represent later tables as a reference to some earlier table, as in bzip2.
- When a one block of data has slightly different frequencies than the previous block -- for example, an article on X-rays has about the same letter frequencies, except much more 'X' and somewhat fewer 'y', than a previous article on yellow-bellied sap-suckers -- use differential compression to compactly represent the new table by giving the (hopefully relatively small) changes from the previous table, as in LZX. We discuss other kinds of differential compression in a later chapter, data differencing.
- or some combination of the above.
- Use some other codeword-to-plaintext-symbol mapping other than dynamic Huffman mapping
- adaptive Huffman: estimate the future letter frequency on-the-fly from the letters already decoded in the plaintext
- universal codes: ...
Implementation tips and tricks
People have spent a long time figuring out ways to make Huffman compression and Huffman decompression really, really fast. These are just a few of the techniques used to make decompressors (or compressors, or both) that give bit-identical the same output that a non-optimized program gives, while running faster.
Software to implement Huffman decompressors typically has 2 parts:
- the setup pulls a compact "frequency table" header, and expands it into some sort of lookup table arranged to make the next part really fast
- the inner loop that converts codewords to plaintext symbols.
The inner loop of a Huffman decompressor finds the codeword that matches the next part of the compressed text, and emits the corresponding plaintext symbol. It's possible for a codeword to be more than 32 bits long, even when compressing 7-bit ASCII text, and bit-aligning and comparing such long words can be a bit of a hassle. There are three approaches to avoiding such long comparisons: (a) length-limited Huffman code. Often "natural" files never use such long codewords, only specially-constructed specifically designed to be worst-case files. Some compressors choose to deliberately limit the length of the Huffman code words to some maximum length. This makes such worst-case files, after compression, slightly longer than they would have been without the limit; it has no effect on the length of compressed "natural" files. Most file formats intended to be decompressed in hardware (video decompressed with a FPGA, etc.) limit the length in order to reduce the cost of that hardware. (b) A short, fast table that handles short codewords and has a "long codeword" flag, and other table(s) that handle longer codewords. Since the most frequent Huffman codes are short, the decoder can usually decode a codeword quickly with one hit to the short, fast table that fits entirely in cache; occasionally the decoder would hit a flagged entry, and take a little longer to look up the long code in the other table(s). There's a time/space tradeoff on exactly how short to make the first-level table: a decoder program with a smaller first-level table, compared to a decoder program that uses a larger first-level table, can still correctly decode the same file. The smaller-table decoder uses less RAM, the larger-table decoder decodes more words in the first hit and therefore runs faster (unless the table gets too big to fit in the cache). (c) rather than comparing directly to codeword bitstrings, first do a zero-count (to the next 1 bit, or to the end of file, whichever comes first) finding Z zeros, discard the 1 bit, and then the next n bits.
Prefix code decompressor implementations can be distinguished by how many bits they read from the compressed stream at one time, and how they use those bits to decide what to do next. People have implemented "fast" decompressors a variety of ways, including:
- One bit at a time: A "simple" prefix decoder typically reads bit-by-bit, one bit at a time, from the compressed stream. This is probably the slowest way.
- always max_codelength bits at a time: one big lookup table indexed directly by multiple bits read from the compressed stream
- one byte (8 bits) at a time: one big lookup table indexed directly by a "tree position" state and a byte read from the compressed stream
- X initial bits, followed by Y following bits, per symbol (where X is some constant, often chosen to be half the number of bits as max_codelength, and Y typically varies depending on the particular value read in the first X bits; X+Y is at most max_codelength):
- a first lookup table indexed by some bits read from the compressed stream, which chooses one of the secondary index tables, indexed by more bits read from the compressed stream
- any number of zero bits, followed by F trailing bits: (This works for canonical Huffman codes, but may not be useful for other prefix codes).
- one big lookup table indexed by a zero count and bits read from the compressed stream
- a first lookup table indexed by a zero count of bits from the compressed stream, which chooses one of the secondary index tables, indexed by more bits read from the compressed stream.
Conceptually the simplest one is the "one big lookup table" approach:
- one big lookup table
This is probably the fastest technique if your maximum codelength is relatively short.
One fast inner loop for a fast prefix code decoder reads the compressed data a byte at a time (rather than the bit-by-bit for a "simple" prefix decoder). The "current tree position" and the compressed byte are used as an index into a large table that holds the resulting "new tree position" and up to 8 decompressed plaintext symbols.
- (new_tree_position, [list of decompressed symbols]) := lookup_table( current_tree_position, compressed_byte )
Any codeword can be represented by two numbers, (Z count of initial zeros, F value of trailing bits). For example, the codeword "000101" can be represented by the two values (3,5). The codeword "11" can be represented by the two values (0,3).
One quirk of the canonical Huffman code is that, if all our plaintext symbols fit in a n bit fixed-length code, then Z can be represented in n bits, and F can also be stored in n bits. (This is *not* true for all prefix codes). (When the decompressor does the actual zero-count in the compressed text, consecutive all-zero code words can lead to an indefinite number of consecutive zeros -- what I'm trying to point out here is that the number of initial zeros in any one particular codeword can be represented in n bits). This allows several different time/speed tradeoffs that all give identically the same output file. Perhaps the fastest inner loop is something like For each symbol:
- bit-count the zeros into Z
- discard the following 1 bit, keep the "next bit to read" pointer pointing to the *following* bit, but peek ahead at the following n-1 bits and put them into F.
- Pack the least-significant n bits of Z and the n-1 bits of F into a 2n-1 bit index.
- (plaintext_symbol, useful_bits) := lookup_table[index]
- Bump the "next bit to read" pointer by useful_bits bits -- which can be 0 bits for codewords with only one bit after the zeros ( 1, 01, 001, 0001, etc.) up to n-1 bits for codewords with the full n-1 bits after the first 1 bit in the codeword.
- emit the plaintext symbol.
Repeat until we hit the end of file.
This ignores the special-case handling of all-zeros codeword. Many Huffman coders, such as DEFLATE, design a Huffman table with a "fake" entry such that the compressed text never uses the all-zeros codeword; so the decompressor never has to deal with that complication. This also ignores special handling of codewords that map to some special command, rather than mapping to a literal plaintext symbol. This requires a lookup table of size 2^(2n-1). For a typical Huffman code with n=9 bits, able to handle all 2^8 bytes a few other special symbols, using a canonical Huffman code with this kind of implementation requires a lookup table of size 2^(17) = 131 072 entries. If a plaintext symbol is mapped to a codelength that is less than the maximum codelength, then multiple entries in that table map to the same symbol. There are other implementations that use far less RAM. Other implementations may even be faster, since they allow more of the working memory to fit in cache.
In general, for alphabet_size plaintext symbols, the worst-case maximum "compressed" Huffman code length is alphabet_size - 1 bits long.
For 257 plaintext symbols, the worst (unlikely) case is a maximum bitlength of 2 characters that are "compressed" to 256 bits, so the maximum zero count (except when we have 1 or more consecutive all-zero codes) is 255.
Handling the end-of-text is often awkward, since the compressed string of bits is usually stored into some integer number of bytes, even though the compressed string of bits is usually *not* a multiple of 8 bits long. Most Huffman decompressors use one or the other of the following techniques for handling end-of-text:
- Both the compressor and the decompressor include a special end-of-text symbol in the tree (a 257th symbol when doing binary byte-oriented compression, or the null byte when compressing a null-terminated C string). Since it only occurs once, its frequency is 1 and the code that represents it is therefore is tied in length with one or more longest-possible-length symbols.
- The decompressor immediately returns when it sees the end-of-text symbol, and ignores any pad bits after that symbol that fill up the last byte of storage. Or,
- The compressor tells the decompressor exactly how many valid input bytes are used to hold the compressed bitstring. The decompressor forces the code representing the end-of-text symbol to be the all-zeros symbol. The decompressor fakes reading in zero bits past the last valid input byte -- possibly reading a few zero bits at the end of some other valid symbol -- eventually reading the all-zero end-of-text symbol.
- Some decompressors have no special end-of-text symbol.
- The compressor tells the decompressor exactly how many output symbols are represented by the input compressed bitstring, and the decompressor immediately returns after emitting exactly that many symbols. Or,
- The compressor tells the decompressor exactly how many valid input bytes are used to hold the input compressed bitstring. Occasionally the last valid symbol ends cleanly at the end of the last byte, and the decompressor returns. Other times there are a few leftover bits not long enough to make a complete valid codeword -- all codewords with that prefix are longer. When the leftover bits are all zero, throw away those leftover zero bits and return. (This technique *assumes* that the all-zeros code is at least 8 bits long -- which is always true when there are at least 2^7+1 = 129 symbols in the tree and you use a canonical Huffman code. If you have fewer than 129 symbols, you must use some other technique to handle end-of-text).
improving on Huffman
The original Huffman algorithm is optimal for symbol-by-symbol coding of a stream of unrelated symbols with a known probability distribution.
However, several algorithms can be used to compress some kinds of files smaller than the Huffman algorithm. These algorithms exploit one or another of the caveats in the Huffman optimality proof:
- unrelated symbols: most real data files have some mutual information between symbols. One can do better than plain Huffman by "decorrelating" the symbols, and then using the Huffman algorithm on these decorrelated symbols.
- symbol-by-symbol: Several compression algorithms, such as range coding, give better compression than binary Huffman by relaxing the binary Huffman restriction that each input symbol must be encoded as an integer number of bits.
- known probability distribution: Usually the receiver does not know the exact probability distribution. So typical Huffman compression algorithms send a frequency table first, then send the compressed data. Several "adaptive" compression algorithms, such as Polar tree coding, can get better compression than Huffman because they converge on the probability distribution, or adapt to a changing probability distribution, without ever explicitly sending a frequency table.
This is reminiscent of the the way many algorithms compress files smaller than the "optimum" Fixed Bit Length Encoding codeword length by exploiting the caveat in the Fixed Bit Length Encoding optimality proof:
Fixed Bit Length Encoding is optimal for symbol-by-symbol coding into fixed-length codewords of a stream of symbols with a known finite alphabet size A. The optimum codeword length for Fixed Bit Length Encoding is .
- finite alphabet: Fixed Bit Length Encoding is useless for infinite alphabets; "universal codes" (described below) handle such alphabets easily.
- symbol-by-symbol: Several compression algorithms, such as LZW, have fixed-length codewords with better compression than Fixed Bit Length Encoding, by using some codewords to represent symbol pairs or longer strings of symbols.
- fixed-length codewords: variable-length codes often give better compression.
arithmetic coding and range coding
The "universal codes" map an infinitely large known alphabet -- often the set of all positive integers, or the set of all integers -- to prefix codes.
There are many "universal codes", a kind of prefix code.
Any particular prefix code (including any universal code) can be seen as a kind of Huffman code for a particular corresponding frequency distribution.
Typical audio data, and typical audio data prediction residues, have approximately Laplacian distribution. Prediction error coding for image compression and geometry compression typically produces an approximately Laplacian distribution. Encoding a Laplacian distribution with a general Huffman code gives a prefix code equivalent to using a Golomb code. The Golomb code is much simpler and faster than a general Huffman code to compress and decompress. If the audio data is "close enough" to a Laplacian distribution, a Golomb code may give even better net compression than a general Huffman code, because a Golomb decoder requires a smaller overhead of 1 parameter, while general Huffman decoder requires the overhead of a Huffman table of probabilities.
Rice coding is a subset of Golomb coding that is much faster to compress and decompress. Because Rice codes closely approximate Huffman codes for a Laplacian distribution, they compress almost as well as general Huffman codes.
Todo: add a few words about Fibonacci codes, a kind of "universal code".
"The Gerzon/Craven theorems show that the level of the optimally de-correlated signal is given by the average of the original signal spectrum when plotted as dB versus linear frequency. ... this dB average can have significantly less power than the original signal, hence the reduction in data rate. ... In practice, the degree to which any section of music data can be ‘whitened’ depends on the content and the complexity allowed in the prediction filter." 
Typical audio data, and typical audio data prediction residues (after decorrelating), have approximately Laplacian distribution, and so they are often decoded with a Golomb decoder or Rice decoder rather than a general Huffman decoder.
Some audio decoders can be told to use Rice decoders during passages of music that compress well with them, and later told to switch to some other decoder during passages of music (such as pure sine wave tones, or all values equally probable white noise) that don't compress well with Rice decoders.
In some audio applications, in particular 2-way telephone-like communication, end-to-end latency is critical. The recommended maximum time delay for telephone service is 150 milliseconds(Wikipedia:1 E-1 s). This rules out many popular compression algorithms. We discuss latency in more detail in the Data_Compression/Evaluating_Compression_Effectiveness#latency section of this book.
Context-adaptive binary arithmetic coding
- Wikipedia: entropy (information theory). Retrieved 2011-07-29.
- "Constructing Word-Based Text Compression Algorithms (1992)" by Nigel Horspool and Gordon Cormack.
- Tomáš Kuthan and Jan Lánský "Genetic algorithms in syllable based text compression". 2007. "The number of unique syllables in one language is measured in tens of thousands ... HuffSyllable is a statistical syllable-based text compression method. This technique was inspired by the principles of HuffWord algorithm."
- "Hacking Data Compression" by Andy McFadden
- "Canonical Huffman" by Arturo San Emeterio Campos. 1999.
- Canonical Huffman Codes by Michael Schindler.
- "Huffman Code Discussion and Implementation", including Canonical Huffman Codes. by Michael Dipperstein.
- "Decoding of Canonical Huffman Codes with Look-Up Tables" by Yakov Nekritch. 2000.
- "Space- and time-efficient decoding with canonical Huffman trees" by Shmuel T. Klein 1997.
- Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes. New York: Van Nostrand Reinhold, 1994. ISBN 9780442018634. pp. 33-35.
- "Huffman encoding [DRAFT"] apparently by Lara Hopley and Jo van Schalkwyk (?). mentions canonical Huffman: "We will allocate '1's to shorter codes." mentions "even a canonical Huffman tree needn't be unique. We still have some scope for playing!" (describing situations where, even though we assign *different* lengths to letters and therefore different canonical Huffman trees, we end up with the same number of compressed bits)
- Michael Schindler. "Practical Huffman coding"
- Renato Pajarola. "Fast Prefix Code Processing". "Proceedings IEEE ITCC Conference". 2003.
- NES development: Fixed Bit Length Encoding
- Andrew Polar. "Non-Huffman binary tree". July, 2010.
- Shmuel T. Klein, Miri Kopel Ben-Nissan. "On the Usefulness of Fibonacci Compression Codes". 2004. via 
- "MLP Lossless Compression" by JR Stuart, PG Craven, MA Gerzon, MJ Law, RJ Wilson 1999. section 4.2 "Prediction".