# Formal Languages

Central to ideas of complexity in computer science is the concept of formal languages. A formal language is a collection of words, which are strings of finite length drawn from a finite alphabet of symbols.

While each word in a language must be finite, a language may contain words of arbitrary length, and thus may contain an infinite number of words. An example of such a language is the collection of all strings drawn from the alphabet {a,b} which contain an equal number of as and bs. This language thus contains any string of the form $a^nb^n$, where $a^n$ is the symbol a repeated n times. Since n can be any positive integer, there must be an infinite number of words in this language.

Many formal languages have associated with them a generative grammar, which is a set of replacement rules containing three types of symbols: ε, which signifies the empty string (a string of length 0), non-terminal symbols, which are symbols not in the alphabet of the language, and terminal symbols, which are in the alphabet of the language. By convention, non-terminal symbols are usually capital letters, while terminal symbols (and thus the alphabet of the associated formal language) are usually lower-case letters. The grammars also include a non-terminal symbol called the start symbol, which is what the replacement rules assume is present before any replacements happen. By making choices about which replacement rule to apply when, the start symbol can be converted into any word in the language associated with that grammar and into no other words, which is why the grammar is said to generate the language. How these replacement rules can be formed depends on the language class involved, and each set of restrictions on the grammar creates a class of generative grammars. Not every class of formal languages has a class of generative grammars associated with it.

The other object often associated with language classes is recognizing machines. How these machines operate and what their restrictions are depends on the language class they correspond to, but a machine is said to recognize a language if, when given any word in that language as an input, the machine always "accepts" the word, and never "accepts" a word that is not in the language. Not all formal language classes have a type of recognizing machine associated with them, but most languages studied by computer scientists do.

There are a number of classes of different languages, each based on the complexity of the languages it contains. The best known language classes are the four classes of the Chomsky Hierarchy, which were defined by Noam Chomsky in his 1959 paper "On Certain Formal Properties of Grammars" 1. Information on other language classes can be found in the chapter Other Language Classes.

In summary:

• An alphabet is a finite collection of symbols.
• A word is a string of finite length whose symbols are drawn from an alphabet.
• A formal language is a collection of words.
• A generative grammar is a set of replacement rules that allow you to create any word in the language (and no other words) from a start symbol.
• A recognizing machine is a machine that "accepts" any word in the language (and no other words).
• A language class is a set of languages that are of similar complexity, so they have similar types of generative grammar and similar types of recognizing machines.