Data Compression/Multiple transformations

From Wikibooks, open books for an open world
Jump to: navigation, search

combination[edit]

Early text data compression techniques were divided into two very different-seeming approaches:

  • fixed-to-variable decompressors that decompress fixed-size "references" into plaintext strings of varying length (Ziv and Lempel), and
  • variable-to-fixed decompressors that decompress variable-length "prefix codes" or "arithmetic codes" into plaintext strings of some fixed length (Markov).

Researchers are searching for some way to combine them, to get the benefits of both. [1]

A typical paragraph of text has several different kinds of patterns. Some archive software applies a series of different transformations, each transformation targeting a different kind of pattern in the text.

Such a transformation is often based on some kind of model of how the file was produced.

Compression software (especially image compression software) often sends the raw data through one or more stages of decorrelation -- such as Karhunen–Loève transform[2], color space transformation, linear filtering, DFT, transform coding, etc. -- the "precompressor" stages -- before feeding the result into an entropy compressor.

Many transforms used in compression, counter-intuitively, produce an intermediate result that requires more bits of dynamic range to store. Transforming raw 24-bit RGB pixels into 26-bit YCoCg-R pixels[2] or transforming raw 16-bit audio samples into 17-bit delta coded differences may seem to "waste" bits, but it improves net compression performance: In many cases, the entropy compressor gives smaller compressed file sizes when given such "de-correlated" intermediate data than when given the original raw data. (The improvement in results resulting from such transformations is often called "coding gain").

color transforms[edit]

The JPEG 2000 Reversible Color Transform (RCT) is a YUV-like color space that does not introduce quantization errors, so it is fully reversible.

The compressor converts RGB to YCbCr with:


Y = \left\lfloor \frac{R+2G+B}{4} \right\rfloor ;
C_B = B - G ;
C_R = R - G ;

and then compresses the image.

The decompressor decompresses the image, then takes YCbCr pixels and restores the original RGB image with:


G = Y - \left\lfloor \frac{C_b + C_r}{4} \right\rfloor ;
R = C_R + G ;
B = C_B + G.

A pixel in the JPEG 2000 YCbCr format requires 26 bits to represent losslessly.

The Red, Green, and Blue layers of an image are typically highly correlated. So compressing the R, G, and B layers independently ends up storing much of the same data 3 times.

Even though "decorrelating" a 24-bit-per-pixel RGB image into a YCbCr requires *more* bits (26 bits) to hold each pixel, decorrelating helps later stages in the compression pipeline do a better job. So independently compressing the Y, Cb, and Cr planes usually results in a smaller compressed file than independently compressing the R, G, and B layers of the same file.

Most (but not all!) lossless color space transforms, including the JPEG 2000 RCT, are "expanding". "Expanding" color space transforms convert 24 bit RGB pixels to intermediate YCbCr pixels that require *more* than 24 bits to represent losslessly.

A few lossless color space transforms are non-expanding. "Non-expanding" color space transforms convert 24 bit RGB pixels to intermediate YCoCg24 pixels that require the same 24 bits to represent losslessly.

All color space transforms that convert 24 bit RGB pixels to something less than 24 bits per pixel are lossy. (Do I need to spell out why?).

Clipboard

To do:
list some alternative lossless color space transforms, perhaps even one of the rare "non-expanding" transforms listed at http://stackoverflow.com/questions/10566668/lossless-rgb-to-ycbcr-transformation

DEFLATE[edit]

DEFLATE combines LZ77 with Huffman encoding. It is specified in RFC 1951.

It is extremely widely used to decode ZIP files and gzip files and PNG image files.

zlib[edit]

The "zlib compressed data format specification" is a standardized way of packaging compressed data in a file, with a simple header that indicates the type of compression and helps detect the most common kinds of data corruption. It is often used to store data compressed with DEFLATE with a window size up to 32K.

The "zlib compressed data format specification" is specified in RFC 1950.

LZX[edit]

The LZX file format and compression algorithm was invented by Jonathan Forbes and Tomi Poutanen. LZX is used in a popular stand-alone AMIGA file archiver. Microsoft Compressed HTML Help (CHM) files use LZX compression. Windows Imaging Format (".WIM") files support several data compression methods, including LZX. Xbox Live Avatars use LZX compression.

LZX uses both LZ77-style sliding window compression and dynamic Huffman codes, somewhat like DEFLATE uses both. The LZ77-style window size (WINDOW_SIZE) is some power-of-two, at least 2^15 and at most 2^21. LZX uses several Huffman tree structures, each of them canonical.

The most important Huffman tree in LZX is the "main tree", composed of 256 elements corresponding to all possible ASCII characters, plus 8*NUM_POSITION_SLOTS elements corresponding to matches. (The NUM_POSITION_SLOTS is in the range of 30 slots to 50 slots). The second most important Huffman tree in LZX is the "length tree", composed of 249 elements. Other trees have smaller roles.

The LZX compressor arbitrarily divides up a file into one or more blocks. Each block is one of 3 types: "verbatim block", "aligned offset block", "uncompressed block". The "uncompressed block" has a few more things in the header, and then a literal copy of the uncompressed data.

In the data portion of the "verbatim block" and "aligned offset block" blocks, LZW uses length-limited Huffman codewords; LZW limits each codeword to at most 16 bits wide. Each "verbatim block" and "aligned offset block" has a larger header that gives the information (in a very compact differentially-compressed form) required for the decoder to regenerate the various Huffman trees used in that block.

The header is decoded to get the codeword length for each symbol, has a length of 0 to 16 bits (inclusive), where "0 bits" indicates the corresponding symbol never occurs in this block, and "16 bits" are used for symbols that occur extremely rarely. A variety of techniques are used to compress the header into relatively few bits.

When the "distance" value is decoded, there are several special values. It happens quite often that two long phrases of English text are *almost* identical, but there is a small typo or substitution in the middle of the otherwise-identical long phrase. With the original LZ77, the result is 3 copy items: [long distance X to first part, length of first part], [distance Y to the insertion, length of insertion], [exactly the same long distance X to start of second part, length of second part]. LZX, rather than repeat exactly the same long distance X twice, uses a special "distance" value to indicate that we are "continuing" to copy a previous block after making a substitution in the middle. (The special distance value ends up occupying fewer bits than the full long distance).

  • "distance" == 0: "repeat the most recent match distance, which was not itself a repeated distance"
  • "distance" == 1: "repeat the second-most-recent match distance, which was not itself a repeated distance"
  • "distance" == 2: "repeat the third-most-recent match distance, which was not itself a repeated distance"
  • distance == 3: copy starting from 1 byte back, the most-recently-decoded byte (pseudo-RLE on bytes), the closest allowable.
  • distance == 4: copy starting from 2 bytes back, the next-most-recently-decoded byte (pseudo-RLE on byte pairs)
  • distance == 5: copy starting from 3 bytes back
  • distance == 6: copy starting from 4 bytes back
  • ...
  • distance == 1000: copy starting from 998 bytes back
  • ...
  • distance = x+2: copy starting from x bytes back
  • ...
  • distance == WINDOW_SIZE-1 (all bits high): copy starting from WINDOW_SIZE-3 bytes back, the most distant copy possible.

After the decompressor uses the block header to set up all the trees, the decompressor's main loop in a verbatim block is:

  • Read a codeword from the compressed text using the main tree and decode it into a "main element".
  • If the "main element" is less than 256, emit the "main element" as a literal plaintext byte.
  • If the "main element" is 256 or more, we're going to have a copy item.
    • Break the "main element - 256" into two parts: a "slot" and a 3 bit "length".
    • If the "length" is the special all-ones value "7", read another codeword from the compressed text using the "length tree" and decode it into a "long length"; add that to the 7 already in the length.
    • If the "slot" is in the range 0..3, then set the distance the same as the slot.
    • If the "slot" is in the range 4..37, then extra = ((slot >> 1) - 1) more bits are needed to get the exact distance.
      • read those extra raw "footer" bits from the compressed text, (not using any Huffman tree, just the raw bits). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the extra raw footer bits.
    • If the distance is 0..2, it indicates a "repeated distance" (as described above) -- replace it with the actual distance we used before.
    • Do a normal LZ77-style copy from the previously decoded plaintext: copy from a distance (as described above) a total of "length + 2" bytes into the decoded plaintext.
  • repeat until we've decoded the total number of bytes that the header says the uncompressed plaintext should contain.

After the decompressor uses the block header to set up all the trees, the decompressor's main loop in an aligned block is:

  • Read a codeword from the compressed text using the main tree and decode it into a "main element".
  • If the "main element" is less than 256, emit the "main element" as a literal plaintext byte.
  • If the "main element" is 256 or more, we're going to have a copy item.
    • Break the "main element - 256" into two parts: a "slot" and a 3 bit "length".
    • If the "length" is the special all-ones value "7", read another codeword from the compressed text using the "length tree" and decode it into a "long length"; add that to the 7 already in the length.
    • If the "slot" is in the range 0..3, then set the distance the same as the slot.
    • If the "slot" is in the range 4..37, then extra = ((slot >> 1) - 1) more bits are needed to get the exact distance.
      • if extra > 3, read (extra - 3) raw "footer" bits from the compressed text, (not using any Huffman tree, just the raw bits). Then read another codeword from the compressed text using the "aligned tree" and decode it to 3 bits (0 to 7). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the (extra - 3) raw footer bits, followed by the 3 aligned bits.
      • if extra = 3, read another codeword from the compressed text using the "aligned tree" and decode it to 3 aligned bits (0 to 7). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the 3 aligned bits.
      • if extra is 1 or 2 bits, read those extra raw "footer" bits from the compressed text, (not using any Huffman tree, just the raw bits). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the extra raw footer bits.
    • If the distance is 0..2, it indicates a "repeated distance" (as described above) -- replace it with the actual distance we used before.
    • Do a normal LZ77-style copy from the previously decoded plaintext: copy from a distance (as described above) a total of "length + 2" bytes into the decoded plaintext. (The minimum copy length is 2 bytes).
  • repeat until we've decoded the total number of bytes that the header says the uncompressed plaintext should contain.

The Microsoft "cabinet" (".CAB") compressed archive file format supports 3 data compression methods: DEFLATE, LZX, and Quantum compression. The compressor that puts executable files into a ".CAB" archive may optionally "detect "CALL" instructions, converting their operands from relative addressing to absolute addressing, thus calls to the same location resulted in repeated strings that the compressor could match, improving compression of 80x86 binary code." The ".CAB" decompressor, after doing LZX decompression described above, and if the "preprocessing" bit in the header was set to one, must convert those operands back to relative addressing. We will discuss such preprocessor/decorrelators in general, and Executable Compression specifically, in more detail in a later chapter.

Clipboard

To do:
Is this CALL-filter preprocessor used for all executable files in a cabinet, or just LZX-compressed files?

[3][4][5][6]

LZX DELTA[edit]

LZX DELTA (LZXD) is a variation of LZX modified to support efficient delta compression (patching).

A LZX DELTA decompressor takes the old version of some plaintext file (this must be exactly the same as the old version used by the compressor) and the LZXD-compressed patch file, and uses them to generate the new version of that plaintext file.

The main difference between LZXD and LZX is that the window is pre-loaded with the reference file, with the decompressor decoding the first byte of the output file into the window at the location immediately after the last byte of the reference file. Normal LZX copy items have a distance that is at most the full distance back to the first byte of the output file; LZXD allows a distance that reaches much further back to allow copies from the reference file.

LZXD supports many more "slots", up to 290 slots, which are encoded into the main tree. This allows LZXD copy items to grab things much further into the past, allowing it to take advantage of very long reference files.

LZXD also supports much larger "length" values. Much as LZX has a "short length" of 0 to 7, where "7" is used as an escape for to get a "long length" from 7 to 255, LZXD uses the same "short length" and "long length" and goes one step further, using a long length of "257" as an escape for an "extra-long length" that can copy 32 KByte in a single copy.

[7]

LZARI[edit]

LZARI uses LZSS pre-processor followed by arithmetic encoding. LZARI was developed by Haruhiko Okumura.

LZH[edit]

LZH uses LZSS pre-processor followed by Huffman coding. LZH was developed by Haruyasu Yoshizaki for the LHA archiver, based on LZARI.

LZMA[edit]

The Lempel–Ziv–Markov chain algorithm (w:LZMA) uses a LZ77-like compression algorithm, whose output is encoded with range coding.

LZMA is one of the compression methods used in the open-source 7-Zip file archiver.

The LZ77-like compression part of LZMA has many similarities with the corresponding LZ77-like compression part of LZX. LZMA does not use Huffman codes -- instead, it uses context-coded binary arithmetic codes.

LZHAM[edit]

LZHAM (LZ, Huffman, Arithmetic, Markov) is a general purpose lossless data compression library.[8]

implementation tips[edit]

Many compressed file formats must be decompressed using an entropy decoder, followed by a whole series of transformations -- a kind of Wikipedia: Pipeline (software).

There are 4 ways to implement such decompressors (and the corresponding compressors):


  • Multi-pass compression: each inner transformation reads all the input from some temporary file, and writes all its output to some temporary file, before the next stage starts.
  • Run every stage more-or-less simultaneously
    • Use co-routines or threads or processes or Unix pipes. Alas, these techniques are not supported in many embedded system environments.
    • Use decompression functions that call "down" to fetch more data. The outer application directly calls only the final transformation stage, which calls some intermediate transformation stage, which calls some other intermediate transformation stage, which calls the entropy decoder at the beginning of the pipeline, which calls the routine that fetches bytes from the file. The outer application repeatedly calls that final transformation stage until all the data is processed.
    • Use decompression functions that return "up" to fetch more data. The outer application directly calls each stage in turn. Each stage reads data from its input buffer, updates its internal state, and writes data to its output buffer. That output buffer is the input buffer of the next stage. The outer application repeatedly cycles through all the stages until all the data is processed.

Converting software from a "call down to fetch more data" API to a "return up to fetch more data" API is difficult.[9]

encryption[edit]

Because encrypt-then-compress doesn't make the file any smaller, many people do compress-then-encrypt.[10][11][12][13][14][15][16][17][18]

A few people are doing research on algorithms that compress and encrypt simultaneously.[19][20]

Some researchers test encryption algorithms by trying to compress the encrypted data. If compression produces a compressed output file that is smaller than the encrypted input file, that tells the researcher that there is a flaw in the encryption software -- either a bug in the implementation, or it's a correct implementation of an insecure algorithm.

implementation tips[edit]

Some people using lossless compression prefer to use "idempotent" (Is there a better term for this?) compression software, even if it takes a little longer to compress.[21] In other words, they expect that when you compress a file with "maximum compression" (often "-9n") to create one compressed file, then uncompress it, then re-compress it, they expect the resulting second compressed file is identical to the first compressed file.

This generally requires that each and every transformation implementation (at least when passed the "-9n" option) is idempotent. Some things that are not idempotent (and so should be avoided, at least when passed the "-9n" option) are:

  • Some compressors store the original filename and timestamp of the uncompressed file in the compressed file. To be idempotent, either the compressor must leave out the name and timestamp (that's what the "-n" typically does) or else the decompressor must restore the original name and timestamp.
  • color transforms (if any) should be reversible (lossless)[2]
  • For any given implementation of a decompression algorithm, there are almost always many different ways for a compressor to represent the original file -- there are many compressed files that will produce identically the same output file. (The exception is "bijective compression"). When attempting to gain "maximum compression", we select the way(s) that are the shortest -- but even then there are usually several ways that will give the same compressed filesize. To be idempotent, the implementation of the compressor must always pick the same way.

some information theory terminology[edit]

(FIXME: replace "idempotent" above with "non-drifting lossy", as defined below? Or is there an even better term?)

Some information theory people[22] classify transformations into 4 categories of loss:

  • bijective (ED = 1 and DE = 1)
  • idempotent (lossless but not bijective: ED = 1 but DE !=1). Nearly all compression algorithms designed for text or executable code are in this category.
  • non-drifting lossy (EDE = E). Most lossy compression algorithms designed for audio and video and still images are in this category. Quantization is a non-drifting lossy transform. Starting from an initial image I0, the first encode/decode cycle gives you a slightly different image I1 (hopefully different only in ways normal humans are unlikely to notice). Starting from I1 and encode/decode to produce I2, then encode/decode I2 to produce I3 doesn't give any more loss: I1 == I2 == I3 == I4 ...
  • The general case: degrading. Every encode/decode cycle generally produces a slightly different image (such as a series of FAX transmissions or a multi-generational copy using physical paper copy machines).


Some transforms are known not to completely remove all redundancy. In particular, researchers have designed special LZ77 encoders and LZW encoders that encode extra "redundant bits" in the choice of which copy was selected, in a way that is backwards-compatible: Standard LZ77 or LZW decoders can extract the original file. Special LZ77 or LZW decoders can extract the original file plus these extra "redundant bits". These extra bits can be used for Reed-Solomon or other forms of error correction, or to store metadata about the original file such as a public signature or message authentication code.[23][24]

References[edit]

  1. [1] "LZRW4: Ziv and Lempel meet Markov" by Ross Williams. 23-Jul-1991.
  2. a b c Henrique S. Malvar, Gary J. Sullivan, and Sridhar Srinivasan. "Lifting-based reversible color transformations for image compression". "the Karhunen-Loève transform... provides maximum decorrelation."
  3. Wikipedia: LZX (algorithm)
  4. "Microsoft LZX Data Compression Format" [2][3]
  5. "lzx.c - LZX decompression source code"
  6. "lzx.c - LZX decompression source code" (has many comments on the difference between the documentation and the implementation, clarifying several details the "official" documentation neglects to specify)
  7. "LZX DELTA Compression and Decompression"
  8. Google Code: lzham by Rich Geldreich.
  9. "Hacking Data Compression" by Andy McFadden
  10. "Can I compress an encrypted file?"
  11. "Compress and then encrypt, or vice-versa?"
  12. "Composing Compression and Encryption"
  13. "Compress, then encrypt tapes"
  14. "Is it better to encrypt a message and then compress it or the other way around? Which provides more security?"
  15. "Compressing and Encrypting files on Windows"
  16. "Encryption and Compression"
  17. "Do encrypted compression containers like zip and 7z compress or encrypt first?"
  18. "When compressing and encrypting, should I compress first, or encrypt first?"
  19. Dahua Xie and C.-C. Jay Kuo ENHANCED MULTIPLE HUFFMAN TABLE (MHT) ENCRYPTION SCHEME USING KEY HOPPING http://viola.usc.edu/Publication/PDF/ISCAS/2004_ISCAS_Xie.pdf 2004
  20. Qing Zhou, Kwok-wo Wong, Xiaofeng Liao, Yue Hu. "On the security of multiple Huffman table based encryption" 2011.
  21. "gzip -9n sometimes generates a different output file on different architectures"
  22. https://groups.google.com/forum/?fromgroups=#!topic/comp.compression/r96nB43uuLs
  23. Stefano Lonardi and Wojciech Szpankowski. "Joint Source-Channel LZ'77 coding". Data Compression Conference, 273-282, Snowbird, 2003.
  24. Yonghui Wu, Stefano Lonardi, Wojciech Szpankowski. "Error-Resilient LZW Data Compression". Data Compression Conference, 193-202, Snowbird, 2006.