Data Compression/References
From Wikibooks, open books for an open world
Contents |
[edit] Benchmark files
- The Canterbury Corpus (1997) is the main benchmark for comparing compression methods. Of these 11 files, the largest is roughly 1 MByte. That web page also links to a few other test files which are useful for debugging common errors in compression algorithms.
- The Silesia Corpus (2003) contains files between 6 MB and 51 MB. The 12 files includes two medical images, the SAO star catalog, a few executable files, etc.
- Matt Mahoney has published a large benchmark text file used in the "Large Text Compression Benchmark"
- Large Text Compression Benchmark is a file "enwik9"(1,000,000,000 bytes), the first 10^9 bytes from the English Wikipedia dump on Mar. 3, 2006.
- The the Hutter Prize involves compressing a file "enwik8"(100,000,000), the first 10^8 bytes of enwik9, ultimately from Wikipedia.
- A large-file text compression corpus, maintained by Andrew Tridgell, is oriented towards relatively large, highly redundant files. It contains 5 files between 27 MB and 85 MB (uncompressed), mostly English text and Lisp, assembly, and C source code. It helps test (implementations of) compression algorithms designed to detect and compress very long-range redundancies, such as lzip[1] and rzip[2].
- "The Calgary corpus"[3][4] is a series of 14 files, most of them ASCII text, and was the de facto standard for comparing lossless compressors before the Canterbury Corpus.
- "The Calgary corpus Compression & SHA-1 crack Challenge" (formerly known as the "The Calgary Corpus Compression challenge") by Leonid A. Broukhis, has paid out several prizes around $100 each for "significantly better" compression of all 14 files in the Canterbury Corpus.
- "The Data Compression News Blog" edited by Sachin Garg. Sachin Garg has also published benchmark images and image compression benchmark results.
- Lasse Collin uses open-source software in his executable compression benchmark.
- Elephants Dream: Original lossless video and audio available: Matt suggests "It would be great to see Elephants Dream become the new standard source footage for video and audio compression testing!"
- Alex Ratushnyak maintains the Lossless Photo Compression Benchmark.
[edit] open-source example code
Most data compression creators release open-source implementations of their algorithms. This makes it much easier to evolve the algorithms by combining combines clever ideas from many different sources.
- Compression Interface Standard by Ross Williams. Is there a better interface standard for compression algorithms?
- jvm-compressor-benchmark is a benchmark suite for comparing the time and space performance of open-source compression codecs on the JVM platform. It currently includes the Canterbury corpus and a few other benchmark file sets, and compares LZF, Snappy, LZO-java, gzip, bzip2, and a few other codecs. (Is the API used by the jvm-compressor-benchmark to talk to these codecs a good interface standard for compression algorithms?)
- inikep has put together a benchmark for comparing the time and space performance of open-source compression codecs that can be compiled with C++. It currently includes 100 MB of benchmark files (bmp, dct_coeffs, english_dic, ENWIK, exe, etc.), and compares snappy, lzrw1-a, fastlz, tornado, lzo, and a few other codecs.
- "Compression the easy way" simple C/C++ implementation of LZW (variable bit length LZW implementation) in one .h file and one .c file, no dependencies.
- BALZ by Ilia Muraviev - the first open-source implementation of ROLZ compression[1]
- QUAD - an open-source ROLZ-based compressor from Ilia Muraviev
- LZ4 "the world's fastest compression library" (BSD license)
- QuickLZ "the world's fastest compression library" (GPL and commercial licenses)
- FastLZ "free, open-source, portable real-time compression library" (MIT license)
- The .xz file format (one of the compressed file formats supported by 7-Zip and LZMA SDK) supports "Multiple filters (algorithms): ... Developers can use a developer-specific filter ID space for experimental filters." and "Filter chaining: Up to four filters can be chained, which is very similar to piping on the UN*X command line."
[edit] Further reading
- Fedora And Red Hat System Administration/Archives And Compression has some practical information on how to use compression
- JPEG - Idea and Practice has more detailed information on the specific details of how compression techniques are applied to JPEG image compression.
- Data Coding Theory/Data Compression
- Kdenlive/Video codecs briefly mentions the most popular video codecs
- Movie Making Manual/Post-production/Video codecs discusses the most popular video codecs used in making movies and video, in a little more detail.
- Movie Making Manual/Cinematography/Cameras and Formats/Table of Formats lists the most popular compressed and uncompressed video formats
- Probability
- The hydrogenaudio wiki has a comparison of popular lossless audio compression codecs.
- a data compression wiki
- a data compression wiki
[edit] non-wiki resources
- "comp.compression" newsgroup
- comp.compression Frequently Asked Questions by Jean-loup Gailly 1999. (is there a more recent FAQ???)
- http://data-compression.info/ has some information on several compression algorithms, several "data compression corpora" (benchmark files for data compression), and the results from running a variety of data compression programs on those benchmarks (measuring compressed size, compression time, and decompression time).
- "Data Compression Explained" by Matt Mahoney. It discusses many things neglected in most other discussions of data compression. Such as the practical features of a typical archive format (the stuff in the thin wrapper around your precious compressed data), the close relation between data compression and artificial intelligence, etc.
- Mark Nelson writes about data compression
- the Encode's Forum claims to be "probably the biggest forum about the data compression software and algorithms on the web".
- "The LZW controversy" by Stuart Caie. (LZ78, LZW, GIF, PNG, Unisys, patents, etc.)