Data Compression/Streaming Compression
file vs streaming compression
Many data compression algorithms are file oriented. They assume that underlying transmission protocols have transmitted every bit of the compressed data file without error.
A few data compression algorithms are designed to be used with streaming one-way broadcast. For example, the MPEG-2 compression algorithm used by almost all over-the-air digital TV. At any time, a receiver may be turned on and start listening to the broadcast. In order for a receiver to start up in mid-stream -- or recover after a few seconds of severe transmission errors -- such systems typically break up the stream into "blocks". The decoder uses synchronization marks or checksums to detect the beginning of the next block, and starts decoding from there. At each new block, it re-initializes and re-starts the decoding process. Block decoders that use a static per-block dictionary or static Huffman table forget the old one and read a new dictionary or table at the beginning of each block. Block decoders that use an adaptive dictionary forget the old dictionary, reinitialize the default dictionary, and begin building a new dictionary.
The block size is a compromise between compression (longer blocks usually give better compression) and quick restarts (shorter blocks allow the decompressor to restart more quickly after an error or after turning on). Because practically all errors can be detected with CRC codes, a streaming decoder knows which blocks are in error, and typically rejects the entire block. Streaming audio decoders use a variety of techniques to mask such errors -- they may attempt to "fill in" with a prediction that previous tones will continue. To avoid "clicks", they may choose to ramp down the volume to silence at the end of the last good block, play silence during the bad block, and then ramp back up to normal volume at the beginning of the next good block.
There has been little research on alternatives to "blocks". Some people speculate that, compared to using a block size of B bytes, it may be possible to achieve better compression and just as quick restarts with a (non-blocking) converging decoder designed so that all decoders, no matter when they are turned on, eventually converge on the same adaptive dictionary after at most B bytes. An adaptive block decoder has very little context (and poor compression) at the beginning of each block, gradually rising to lots of context (and much better compression) at the end of the block. A non-adaptive block decoder has the overhead of a dictionary or Huffman table at the beginning of each block (in effect, zero compression during the dictionary or table, and high compression during the data in the remainder of the block). The converging decoder, in theory, has some intermediate context (and some intermediate compression) at all times, without dictionary or table overhead or times of poor compression.
packet header compression
tape storage compression
One of the earliest applications for data compression was to compress data for backup tapes. Many tape drives have an implementation of a data compression algorithm embedded in the firmware of the tape drive itself. (With modern computers, one can get better compression by turning off this "hardware compression" and using a modern "software compression" algorithm. Software compression algorithms takes advantage of the much faster CPU and much larger RAM available to the main processor than to the processor inside the tape drive. Early computers could not use software compression because they were not fast enough to keep up with the tape).
Many tape drive manufacturers use a LZ algorithm. IBM and QIC use the ALDC algorithm. Exabyte uses the IDRC algorithm. DLT uses the DLZ1 algorithm.
The VXA tape backup format (developed by Ecrix and produced by Exabyte) uses ALDC data compression.
Adaptive Lossless Data Compression algorithm (ALDC) is standardized by ECMA-222.
The "Linear Tape-Open" (TO) tape backup format (produced by several manufacturers) uses LTO-DC data compression, also called Streaming Lossless Data Compression (SLDC).
For legal reasons, many data compression algorithm patents are described as being part of such tape backup storage systems.
Most data compression algorithms are described in terms of abstract "streams" of data of indefinite length. In practice, data almost always comes in finite-length blocks with a distinct beginning and end. Systems that compress and decompress such data eventually come to the end of the block of data.
How does the decompressor know when to stop?
Some people argue that end-handling is "outside of the scope of data compression". And so many programmers use the "single responsibility principle" try separate the "end handling" code from the "decompression" code -- "end handling" is considered the responsibility of some other layer in the protocol stack. (Those other layers are described in more detail in other books, such as Data Coding Theory, Communication Networks, Serial Programming, etc. ).
Archive file formats generally store compressed data in a "container format" that stores a block of compressed data with some metadata. There are 5 popular approaches to implementing such archive formats and streaming protocols:
- The metadata includes both
- the compressed length (this makes it much easier to skip over compressed files in an archive to decompress just one desired file),
- and* the decompressed length (this makes implementing the decompressor simpler)
- The metadata describes exactly how many bits/bytes/symbols/pixels are in the *decompressed* data.
The software that decodes the container format passes this "uncompressed length" and the compressed data to the decompressor, and the decompressor keeps a count of exactly how many bits/bytes/symbols/pixels it has produced so far, and stops when it produces enough. (Sometimes it is simpler to allow the decompressor to decode a few extra bits/bytes/symbols/pixels into a temporary buffer, and then copy the correct number into the final output). In some systems, the decompressor is required to return a count of how many *compressed* bytes it consumed, so the container handler knows where to continue decoding the next part of the file.
- The compressed data is stored in some kind of "container format" that includes metadata describing exactly how many significant bits are in the *compressed* data.
The software that decodes the container format passes this "compressed length" and the compressed data to the decompressor, and the decompressor keeps a count of exactly how many bits it eats up so far. In some systems, the decompressor is required to return a count of how many *decompressed* bytes it produces, so the container handler knows how many decompressed bytes in the buffer are "real" data.
- The compressed data is stored in some kind of "container format" that includes metadata describing exactly how many bytes are in the *compressed* data,
but the number of significant bits in the last byte is unknown. (Often the compressor pads out the compressed data with 0 bits until it fills out a whole byte. Somehow the decompressor needs to figure out the difference between "this is a codeword that happens to end with some (significant) zeros" -- in particular, the all-zeros codeword -- vs "this is not a real codeword but only zero padding". Software to handle this distinction is notoriously difficult to get exactly right and bug-free in all cases). In some systems, the decompressor is required to return a count of how many *decompressed* bytes it produces, so the container handler knows how many decompressed bytes in the buffer are "real" data.
- The compressed data is stored or transmitted in a way so that neither the "compressed length" nor the "uncompressed length" are known,
and somehow the decompressor is responsible for detecting the end and returning both the decompressed data and that "end point" metadata to the rest of the software.
Container formats are sometimes described as having separate "metadata sections" and "compressed data sections". But when Evaluating Compression Effectiveness, we must consider *all* the data given to the decompressor as the "compressed size" or "compressed rate" -- both the "compressed data sections" and data such as "uncompressed data size", "compressed size in bytes", and "compressed size in bits" that may normally be considered part of the "metadata sections".
A few people prefer compressed data formats designed such that concatenating a bunch of compressed files into one big file, and then later decompressing that big file, "works correctly" -- i.e., it produces one big decompressed file that is identical to concatenating all the original decompressed files into one big output file.
file compression vs compressed file systems
FIXME: Is there a better place in this book to discuss compressed file systems and flash memory?
Many data compression algorithms are whole-file oriented. They assume that the entire original file is available up-front, and people will want to decompress the entire thing from beginning to end.
Compressed file systems (and especially flash-memory file systems) break that assumption. The ZFS file system uses the LZJB compression algorithm. Many internet routers and internet firewalls use a compressed flash-memory file system, often the cramfs file system. (See Reverse Engineering/File Formats#Compression, Encryption & Scrambling for an example). Some "solid-state hard drives" (SSDs) internally use compression algorithms to reduce the amount of flash "used up" by files -- this doesn't actually give users any more storage space, but the larger amount of empty storage space can be used internally by the SSD in ways that extend the lifetime of the drive.
These systems require relatively rapid random-access to data stored in the file system -- decompressing everything from the beginning would be too slow.
One approach is to use an index that, given some file name we want to look up (or a "logical block number"), it tells where exactly on the disk (or in the bulk flash) to start decompressing, combined with a streaming compression algorithm -- so the decompressor can jump into the middle of the compressed data and start decompressing from there.
A much more difficult requirement of these systems is to allow files to be edited. (This is so difficult that some compressed file systems, such as cramfs, such as python executable zipfiles, don't allow it -- they must be created all-at-once from a directory of uncompressed files. After a cramfs system is created, it can only be mounted as read-only).
FIXME: go into a little more detail on why unmodified file-oriented compression algorithms won't work for compressed file systems, and what techniques are used to either (a) modify those algorithms so they are suitable, or (b) other algorithms that are useful for compressed flash file systems, or (c) combinations of both.
- Craft, D. J. "A fast hardware data compression algorithm and some algorithmic extensions". IBM Journal of Research and Development. 1998.
- ECMA. "Standard ECMA-222: Adaptive Lossless Data Compression Algorithm"
- Radomir Dopieralski. "Your Python Application as a Single File"