Cryptography/Frequency analysis

From Wikibooks, open books for an open world
Jump to: navigation, search

In the field of cryptanalysis, frequency analysis is a methodology for "breaking" simple substitution ciphers, not just the Caesar cipher but all monoalphabetic substitution ciphers. These ciphers replace one letter of the plaintext with another to produce the cyphertext, and any particular letter in the plaintext will always, in the simplest and most easily breakable of these cyphers, turn into the same letter in the cypher. For instance, all E's will turn into X's.

Graph of the relative frequency of letters in the English language

Frequency analysis is based on the fact that certain letters, and combinations of letters, appear with characteristic frequency in essentially all texts in a particular language. For instance, in the English language E is very common, while X is not. Likewise, ST, NG, TH, and QU are common combinations, while XT, NZ, and QJ are exceedingly uncommon, or "impossible". Given our example of all E's turning into X's, a cyphertext message containing lots of X's already seems to suggest one pair in the substitution mapping.

In practice the use of frequency analysis consists of first counting the frequency of cypher text letters and then assigning "guessed" plaintext letters to them. Many letters will occur with roughly the same frequency, so a cypher with X's may indeed map X onto R, but could also map X onto G or M. But some letters in every language using letters will occur more frequently; if there are more X's in the cyphertext than anything else, it's a good guess for English plaintext that X stands for E. But T and A are also very common in English text, so X might be either of them. It's very unlikely to be a Z or Q which aren't common in English. Thus the cryptanalyst may need to try several combinations of mappings between cyphertext and plaintext letters. Once the common letters are 'solved', the technique typically moves on to pairs and other patterns. These often have the advantage of linking less commonly used letters in many cases, filling in the gaps in the candidate mapping table being built. For instance, Q and U nearly always travel together in that order in English, but Q is rare.

Frequency analysis is extremely effective against the simpler substitution cyphers and will break astonishingly short ciphertexts with ease. This fact was the basis of Edgar Allan Poe's claim, in his famous newspaper cryptanalysis demonstrations in the middle 1800's, that no cypher devised by man could defeat him. Poe was overconfident in his proclamation, however, for polyalphabetic substitution cyphers (invented by Alberti around 1467) defy simple frequency analysis attacks. The electro-mechanical cypher machines of the first half of the 20th century (e.g., the Hebern? machine, the Enigma, the Japanese Purple machine, the SIGABA, the Typex, ...) were, if properly used, essentially immune to straightforward frequency analysis attack, being fundamentally polyalphabetic cyphers. They were broken using other attacks.

Frequency analysis was first discovered in the Arab world, and is known to have been in use by about 1000 CE. It is thought that close textual study of the Koran first brought to light that Arabic has a characteristic letter frequency which can be used in cryptoanalysis. Its use spread, and was so widely used by European states by the Renaissance that several schemes were invented by cryptographers to defeat it. These included use of several alternatives to the most common letters in otherwise monoalphabetic substitution cyphers (i.e., for English, both X and Y cyphertext might mean plaintext E), use of several alphabets -- chosen in assorted, more or less, devious ways (Leon Alberti seems to have been the first to propose this), culminating in such schemes as using only pairs or triplets of plaintext letters as the 'mapping index' to cyphertext letters (e.g., the Playfair cipher invented by Charles Wheatstone in the mid 1800s). The disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both encyphering and decyphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could learn it as Wheatstone and Playfair had shown, 'our attaches could never learn it!'.

Frequency analysis requires a basic understanding of the language of the plaintext, as well as tenacity, some problem solving skills, and considerable tolerance for extensive letter bookkeeping. Neat handwriting also helps. During WWII, both the British and Americans recruited codebreakers by placing crossword puzzles in major newspapers and running contests for who could solve them the fastest. Several of the cyphers used by the Axis were breakable using frequency analysis (e.g., the 'consular' cyphers used by the Japanese). Mechanical methods of letter counting and statistical analysis (generally IBM card machinery) were first used in WWII. Today, the hard work of letter counting and analysis has been replaced by the tireless speed of the computer, which can carry out this analysis in seconds. No mere substitution cypher can be thought credibly safe in modern times.

However, modern cyphers are not simple substitution cyphers in any guise. They are much more complex than WWII cyphers, and are immune to simple frequency analysis, and even to advanced statistical methods. The best of them must be attacked using fundamental mathematical methods not based on the peculiarities of the underlying plaintext language. See Cryptography/Differential cryptanalysis or Cryptography/Linear cryptanalysis as examples of such techniques.

Letter frequency.PNG