Data Compression/Evaluating Compression Effectiveness
When an application programmer is deciding which compression algorithm to use, there are a number of factors that must be balanced: 
- Speed: how fast is it?
- compression ratio: There's really no point if it doesn't make our data smaller than the uncompressed size.
- complexity: How big is the executable? (This has both hard costs -- how much ROM or disk space does the software use? -- as well as soft costs -- if I need to change it, how long does it take to go through all the source code to find the right thing to change, and once changed, how long does it take to test every part of the code to make sure the change doesn't break something else?)
- space: how much RAM does it need to run?
- latency: how long do I have to wait before I hear the first note of the song?
- interoperability: Will the files I generate be readable by any standard archive utility?
When a compression library programmer has tweaked a compression algorithm and trying to decide if it is really better or if he should revert to the previous version, he uses the same criteria.
Speed[edit | edit source]
When evaluating data compression algorithms, speed is always in terms of uncompressed data handled per second.
- speed = number of uncompressed bits that can be handled in one second.
Some applications use data compression techniques even when they have so much RAM and disk space that there's no real need to make files smaller. File compression and delta compression are often used to speed up copying files from one end of a slow connection to another. Even on a single computer, some kinds of operations are significantly faster when performed on compressed versions of data rather than directly on the uncompressed data. In particular, some compressed file formats are designed so that compressed pattern matching -- searching for a phrase in a compressed version of a text file -- is significantly faster than searching for that same phrase in the original uncompressed text file.
(Is "zgrep" or "zcat" relevant here?)
Some compression algorithms specifically designed to compress bitmaps -- the byte-aligned bitmap code (BBC), the word-aligned hybrid code (WAH), the position list word aligned hybrid (PLWAH), the compressed adaptive index (COMPAX), and the compressed 'n' composable integer set (CONCISE) -- allow bitwise operations to be directly applied to compressed data. This can be much faster than decompressing the data, applying the bitwise operation, and then re-compressing the result.
(FIXME: move some of the above to compressed domain processing. )
Decompression Speed[edit | edit source]
In many applications, the decompression speed is critical. If a particular implementation of an audio decompressor running on a prototype portable music player hardware cannot sustain 1.4 Mbit/s to the headphones, then it is unusable. No one will buy it unless you switch to a different implementation or faster hardware (or both) that can keep up with standard stereo audio speeds.
Compression Speed[edit | edit source]
In a few applications, the compression speed is critical. If a particular implementation of an audio compressor running on a prototype voice recorder cannot sustain 7 bits/sample/channel x 1 channel x 8 kSamples/s = 56 kbit/s from the microphones to storage, then it is unusable. No one wants their recorded voice to have silent gaps where the hardware could not keep up. No one will buy it unless you switch to a different implementation or faster hardware (or both) that can keep up with standard telephone-quality voice speeds.
Decompression Speed and Compression Speed[edit | edit source]
The speed varies widely from one machine to another, from one implementation to another. Even on the same machine and same benchmark file and same implementation source code, using a different compiler may make a decompressor run faster.
The speed of a compressor is almost always slower than the speed of its corresponding decompressor.
Even with a fast modern CPU, compressed filesystem performance is often limited by the speed of the compression algorithm. Many modern embedded systems -- as well as many of the early computers that data compression algorithms were first developed on -- are heavily constrained by speed. There are a limited number of compression algorithms that are fast enough to be usable on extremely speed-constrained systems: 
- algorithms in the LZSS family;
- algorithms in the LZW family;
- fixed byte pair encoding such as PalmDoc;
- delta encoding + adaptive Huffman.
... any others? ...
Space[edit | edit source]
Many modern embedded systems—as well as many of the early computers that data compression algorithms were first developed on—are RAM-limited. When available RAM is so small that there is not enough room for both decompressed text and also a separate dictionary—such as the 12 KByte dictionary needed by a GIF decoder for the LZW dictionary—very few compression algorithms work under those constraints.  With so little RAM,
- If we need the entire decompressed text in RAM—such a self-extracting executable—then we are forced to rule out LZW and use the decompressed text as the dictionary, like LZ77 and Pucrunch
- If we don't need the entire decompressed text in RAM—such as external modems decompressing data sent over "slow" telephone links before forwarding the plaintext to a PC—then for any given amount of RAM, LZW-like compressed formats usually give better compression than LZ77-like compressed formats. For typical English text, the more words and phrases you store in the decompressor's dictionary, the better the compression. LZ77 uses a lot of RAM to hold common words that repeat frequently. LZW uses the available RAM to store each unique word exactly once, and so gives better compression than LZ77 when RAM is limited.
When designing the compressed file format, there is typically a speed/space tradeoff between variable-length formats and byte-aligned formats. Most systems can handle byte-aligned formats much faster. The variable-length formats typically give better compression. The byte-aligned formats can and often do use data sizes other than 8 bit data.  For example, many LZ77-like decompressors use byte-aligned formats with 16-bit "items" that the decompressor breaks into a 3 bit length and a 13 bit offset. Some decompressors use a mix of 1-bit, 8-bit, and 16-bit items, where (for speed) the 1 bit items are carefully packaged into 8-bit "control bytes" so everything else can stay byte-aligned. (Later in the book we discuss byte-aligned formats in more detail: Data Compression/Dictionary compression#Implementation tips and tricks).
Evaluating Compression Ratio[edit | edit source]
In this book, we define the compression ratio as
A algorithm that can take a 2 MB compressed file and decompress it to a 10 MB file has a compression ratio of 10/2 = 5, sometimes written 5:1 (pronounced "five to one").
For streaming audio and video, the compression ratio is defined in terms of uncompressed and compressed bit rates instead of data sizes:
For example, songs on a CD are uncompressed with a data rate of 16 bits/sample/channel x 2 channels x 44.1 kSamples/s = 1.4 Mbit/s. That same song encoded at (lossy "high quality") 128 kbit/s Vorbis stream (or 128 kbit/s MP3 stream or a 128 kbit/s AAC file) yields a compression ratio of about 11:1 ("eleven to one").
That same song encoded with a typical lossless audio compressor such as FLAC or WavPack typically gives a compression radio of about 2:1 to 3:1 ("three to one"), although a few songs give no compression (1:1) and some kinds of classical music give better than 3:1 compression with such a lossless audio compressor.
Using this definition of "compression ratio", for a given uncompressed file, a higher compression ratio results in a smaller compressed file.
(Unfortunately, a few other texts define "compression ratio" as the inverse, or with arbitrary other factors of 8 or 100 inserted, or some other formula entirely).
Some typical compression ratios for lossless compression of text files:
- 2:1 or worse: Extremely simple and fast algorithms such as LZRW1-a that can saturate a 100 MB/s hard drive with compressed data on a 2GHz CPU.
- 2.5:1 to 3.5:1 compression on text files: Slightly more sophisticated (and therefore slightly slower) algorithm such as DEFLATE
- 5:1 Unfortunately, all known algorithms for getting a compression factor better than 5:1 on English language text all run extremely slowly, 5 or more hours to decompress 100 MB of uncompressed data on a 2GHz P4.
- 6.27:1 The current (2009) Hutter prize world record compression
All lossless data compression algorithms give different data compression ratios for different files. For almost any data compression algorithm, it is easy to artificially construct a "benchmarketing" file that can be compressed at amazingly high compression ratio and decompressed losslessly. Unfortunately, such artificially high compression ratios tell us nothing about how well those algorithms work on real data. A variety of standard benchmark files are available. Using a large collection of such benchmark files helps a programmer avoid accidentally over-tuning an algorithm so much that, while it works great on one particular file, it is horrible for other files.
Some standard benchmark files are listed later in this book -- Data Compression/References#Benchmark files.
Generic Compression[edit | edit source]
Some programmers focus on "generic compression" algorithms that are not tied to any particular format, such as text or images. These programmers tune their algorithms on a collection of benchmark files that include a variety of formats.
Format-Specific Compression[edit | edit source]
Other programmers focus on one specific kind of file -- video compression, still image compression, single-human speech compression, high-fidelity music compression, English text compression, or some other specific kind of file. Often these format-specific programmers try to find some kind of redundancies in the raw data that can be exploited for lossless compression -- for example, music often has one dominant tone that repeats over and over at one specific frequency for a few tenths of a second -- but each repeat is never quite exactly the same as any of the others -- then a similar dominant tone that repeats over and over at some other frequency for a few more tenths of a second. Often these format-specific programmers try to find limits to human perception that can be exploited for lossy compression. Often these programmers come up with ways to "transform" or "preprocess" or "de-correlate" the data, and then hand the intermediate results off to some "generic compression" algorithm. We discuss this more at Data Compression/Multiple transformations.
For the particular format it was tuned for, such format-specific compression algorithms generally give much better results than a generic compression algorithm alone. Alas, such algorithms generally give worse results than a generic compression algorithm for other kinds of files.
Latency[edit | edit source]
Latency refers to a short period of delay (usually measured in milliseconds) between when an audio signal enters and when it emerges from a system.
Compression adds 2 kinds of latency: compression latency and decompression latency, both of which add to end-to-end latency.
In some audio applications, in particular 2-way telephone-like communication, end-to-end latency is critical. The recommended maximum time delay for telephone service is 150 milliseconds(Wikipedia:1 E-1 s). This rules out many popular compression algorithms. For example, a standard Huffman compressed block size of 150 milliseconds or longer won't work. Standard Huffman compression requires analyzing an entire block before sending out the compressed prefix code for the first symbol in the block. In this case the Huffman compressor uses up all the time allowed waiting for the block to fill up, leaving no time for time-of-flight transmission or decoding. (A block size of 150 milliseconds or longer may work fine with an adaptive decoder).
In other audio applications, end-to-end latency is irrelevant. When compressing a song for later distribution, or the soundtrack of a movie, the compressor typically has the whole thing available to it before it begins compressing. Such applications may use low-latency algorithms when they are adequate; but they also allow other algorithms to be used that may give better net compression or a lower peak bit rate.
In a few applications, only decompression latency is important. For example, if a particular implementation of an audio decompressor running on a prototype portable music player hardware has a latency of 10 minutes, then it is almost unusable. No one wants to wait 10 minutes after selecting a song before starting to hear it. No one will buy it unless you switch to a different implementation or faster hardware (or both).
Many compression algorithms have a minimum information-theoretic latency, measured in bits. (Is there a better name for what this paragraph and the next discusses than "information-theoretic latency"?) Given a constant uncompressed bitrate, this corresponds to the worst-case delay between when a (uncompressed) bit goes into the compressor, to the time the corresponding (uncompressed) bit comes out the decompressor, in situations where the bitrate is so slow that we can neglect the time required for the computations done in the compressor and the decompressor and the time-of-flight.
A fixed prefix coder or an adaptive Huffman coder typically has an extremely short information-theoretic latency. Many of them have latencies of less than 16 bits.
MP3 compressors sacrifice latency and quality to gain much higher compression ratios. They have a latency of at least 576 time-domain 16-bit samples at 44.1 kHz, giving a latency of at least 9,216 bits or 13 milliseconds, often longer to take advantage of the "byte reservoir".
Energy[edit | edit source]
There has been little research done on the amount of energy used by compression algorithms.
In some sensor networks, the purpose of compression is to save energy. By spending a little energy in the CPU compressing the data, so we have fewer bytes to transmit, we save energy in the radio -- the radio can be turned on less often, or for shorter periods of time, or both.
The best compression algorithms for such sensor networks are the ones that minimize the total energy, and so maximize the runtime, the length of time between battery replacements. Such algorithms are in a niche between, on one side, algorithms that produce smaller compressed files, but use up too much CPU energy to produce them; and on the other side, algorithms that use less CPU energy, but produce (relatively) longer files that take far more energy to transmit.
Comparisons[edit | edit source]
We use the game-theory term "dominates" to indicate that one algorithm is faster, gives smaller compressed files, and has lower latency than some other algorithm. Of course, some particular implementations are not very well optimized, and so they could be tweaked to run *much* faster and still implement the same algorithm with the same compressed file format. But any abstract algorithm necessarily requires some minimum number of operations, and so it is unlikely that an algorithm that requires a couple orders of magnitude more operations than some currently-faster algorithm can ever be optimized enough to dominate that faster algorithm.
Many historically important compression algorithms are now obsolete, having been dominated by some other, more useful algorithm. But currently (and for the foreseeable future) there is no one "best" compression algorithm even for a fixed set of benchmark files -- there is a spectrum of many "best" algorithms along the Pareto frontier; that spectrum of algorithms together dominates and makes obsolete all other known algorithms. Blazingly fast algorithms that give some but not a lot of compression sit at one end of the Pareto frontier. At the far end of the Pareto frontier sit the most compact known ways to compress benchmark files -- but, alas, running so slowly that they are not very useful.
Further reading[edit | edit source]
- "Hacking Data Compression" by Andy McFadden 1993
- "A Survey of Compressed Domain Processing Techniques" by Brian C. Smith
-  "SASE: Implementation of a Compressed Text Search Engine" (1997) by Srinidhi Varadarajan And , Srinidhi Varadarajan , Srinidhi Varadarajan , Tzi-cker Chiueh , Tzi-cker Chiueh
- Shmuel T. Klein and Miri Kopel Ben-Nissan. "On the Usefulness of Fibonacci Compression Codes". 2004. 2009.
- "A Text Compression Scheme That Allows Fast Searching Directly In The Compressed File" 1993 by Udi Manber
- "Phrase-based pattern matching in compressed text" by J. Shane Culpepper and Alistair Moffat
- Wikipedia: Compressed suffix array
- Shmuel T. Klein and Dana Shapira. "Searching in Compressed Dictionaries".
- Carlos Avendaño Pérez, and Claudia Feregrino Uribe. "Approximate Searching on Compressed Text".
- Udi Manber. "A text compression scheme that allows fast searching directly in the compressed file".
- Yusuke Shibata , Takuya Kida , Shuichi Fukamachi , Masayuki Takeda , Ayumi Shinohara , Takeshi Shinohara , Setsuo Arikawa. "Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching" 1999. "pattern matching in BPE compressed text is even faster than matching in the original text."
- Wikipedia: bitmap index: compression. FIXME: replace this reference with better reference(s).
- "Pucrunch: An Optimizing Hybrid LZ77 RLE Data Compression Program, aka Improving Compression Ratio for Low-Resource Decompression" by Pasi Ojala
- "Unzipping the GZIP compression protocol" by Joe Rash
- Matt Mahoney. "Generic Compression Benchmark". Further discussion: "benchmark for generic compression".
- IETF. "6LoWPAN Generic Compression of Headers and Header-like Payloads"
- Christopher M. Sadler and Margaret Martonosi "Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks"