Data Compression/Markov models
Markov algorithms are data compression algorithms that have the Markov assumption: the plaintext was produced by something that can be approximated by a Markov model -- i.e., a Markov chain or hidden Markov model or variable-order Markov model or something similar.
Dictionary algorithms can always be emulated by Markov algorithms, but not the converse. 
Fully general Markov models are used in some of the most complicated algorithms with the best compression ratio known -- alas, they are also some of the slowest algorithms in use.
PPP Predictor Compression Protocol
It looks at (approximately) the 3 or 4 most-recently-decoded bytes, finds the last time that sequence occurred, and guesses that the following byte will be the same following byte as last time.
The inner loop of the Predictor decompressor is something like this:
- The Predictor decoder reads a flag byte, containing 8 flags. Each flag bit corresponds to one byte of reconstructed decompressed data.
- For each of the 8 flag bits:
- If the flag bit is 0, then the guess was wrong. The Predictor decoder reads a literal byte from the compressed data, and pass it straight through to the decompressed text. Also store that literal byte in the GuessTable[context], so perhaps the next time this happens we will guess right.
- If the flag bit is 1, then the guess was right. Read the guess from GuessTable[context] and write it to the decompressed text.
- Update the context from the byte we just wrote out to the decompressed text.
- Repeat until there are no more compressed bytes to read.
In any given context, one byte value is the most-likely next-byte. The Predictor compresses that most-likely next-byte to 1 bit. When any other byte occurs, the Predictor expands it to 9 bits. The Predictor can get pretty good compression even when it guesses wrong half the time.
1st order Markov
The 1st order Markov decompressor algorithm is a little more sophisticated than the Predictor algorithm.
After decompressing some byte, a 1st order Markov decompressor looks at every time that byte has ever occurred before. In any given 1-byte context, one byte value is the most-likely next-byte; some other byte value is the next-most-likely next-byte ... and so on; typically there are a few bytes that have only occurred once each in this context -- those are the next-least-likely next-byte values; and typically there are a few bytes that have never occurred in this context -- those are the least-likely next-byte values.
The 1st order Markov decompressor uses some entropy coder to assign each possible next-byte value a unique prefix codeword, such that the more-likely bytes are given a short codeword, the less-likely bytes are given a longer codeword (perhaps 8 bits), and the least-likely bytes are given a long codeword (longer than 8 bits).
In some cases, where the next-byte is overwhelmingly likely to be one particular byte (for example, the byte following 'q' is almost always 'u' in English text), 1st order Markov compresses those cases just as well as the Predictor algorithm. In other cases, Markov compresses significantly better than the Predictor algorithm, because it can compress 'likely' bytes even when they are not the one most-likely byte.
Higher-order letter-level and word-level Markov chains are better at predicting "likely" English phrases. Some people create "parody generators" out of Markov decoders by allowing them to build a Markov chain, but instead of feeding the decoder the particular compressed codewords that give the Markov decoder enough clues to reconstruct some particular plaintext, they feed the Markov decoder randomly-generated bits.
The decompression algorithms with the best compression typically use several different models, each one giving predictions of the next bit or byte, and somehow mix those predictions together to get a final prediction to feed into the entropy coder. As long as the final prediction is not zero or one (never-happens or always-happens), the decompressor can decode all files without error (see The zero-frequency problem for details).
Alas, so far we do not yet know the "best" way to combine such predictions. There are several ad-hoc ways that people have developed to mix predictions together. Some mixing techniques appear to be "better", in that they produce smaller compressed files than other mixing techniques (and better than using any one of the underlying predictions alone).
Before we get to the actual "context mixing" algorithms, let's look at PPM and DMC, which directly use the predictions from one single context or another (rather than mixing predictions):
Prediction by partial matching is perhaps the simplest to understand mixing technique. PPM maintains Markov models of several different orders. A PPM(5) decoder uses a 5th order Markov model whenever possible. If no prediction can be made based on all 5 context symbols -- i.e., the 5 most-recently decoded bytes have never occurred in that order before -- then it falls back on a 4th order Markov model. If the 4th order model cannot make a prediction -- i.e., the 4 most-recently-decoded bytes have never occurred in that order before -- then it falls back on a 3rd order Markov model. The PPM decoder will eventually find some Markov model it can use to compress the text, even if it has to fall all the way back to the default zero order Markov model.
Dynamic Markov compression
Dynamic Markov compression was developed by Gordon Cormack and Nigel Horspool. Gordon Cormack published his DMC implementation in C in 1993, under a non-commercial license. Dynamic Markov compression is a lossless data compression algorithm very similar to PPM, except it predicts one bit at a time, rather than predicting a byte at a time. This makes it slower but gives slightly better compression. It is used as a model or submodel in several highly experimental implementations.
The predictions are fed into an arithmetic coder. (Because the predictions are made one bit at a time, simple Huffman coding rather than arithmetic coding would give no compression).
Some model mixing / context mixing approaches include:
Context tree weighting is another mixing technique.
- Ross Williams. "LZRW4: ZIV AND LEMPEL MEET MARKOV". 1991.
- RFC 1978
- Rather than literally scanning the entire plaintext-so-far for each byte it decompresses, most programmers use a much faster approach that gives the same results: the program updates an internal hash table or 2D array for each byte it decompresses, then directly fetches the data it needs for the next byte from that data structure.
- Jon Bentley. "Generating Text" (Section 15.3 of Programming Pearls) 2000.
- Wikipedia: Prediction by partial matching
- Wikipedia: Dynamic Markov compression
- Gordon V. Cormack. "Dynamic Markov Compression (DMC) Version 0.0.0". 1993. dmc.c
- Wikipedia: Context tree weighting