Data Compression/Inference Compression

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Inference Compression[edit | edit source]

The idea behind inference compression is that since the algorithm and its creators do not previous knowledge about the source data but compression can be optimized to different types of data, they will base the compression on a process of type inference. This process is mostly dynamic and non-deterministic consisting in reading part of the source data into a buffer (sniff) and execute some tests to infer its type/characteristics as to decide what compression algorithm to apply. This selection is often rule based (pre-defined on the algorithm).

This of course permits some latitude on how to perform the inference, some algorithms can use for instance the file extension (if present) or even by the file header alone (using file type identifiers like magic numbers or file designator strings), to recognize some well known file formats. They will then use that information to optimize the guessing process to determine the best compression scheme. An example might be the LHA and LHZ compression algorithms that contain a number of compression rules, permitting dynamic optimization in response to rule-base differentiation of the different file types.

Depending on some limitations and requirements (speed and memory utilization for instance), the rule base selection can be dynamic, for instance based in an asymmetric compression scheme, that will allow for multiple inferences, and let them compete to see which compression scheme will do the best job of compressing the data.

The idea is that the more closely you can infer the type of file, the more tightly you can compress it with a compression rule based particularly on that type of data. By tuning to the exact nature of the data, compression can often generate files that are significantly smaller than techniques which only approximate the type of data. For instance, a compression scheme that assumes Shakespearean text -- lots of common English words -- will probably not deal well with stock market tables -- rows of data showing the ticker symbol of a company, its closing price, etc.

Markov models[edit | edit source]

We discuss Markov models in more detail in a later chapter, Markov models.

data differencing[edit | edit source]

We discuss data differencing in more detail in a later chapter, data differencing.