# Regular Languages

Regular languages are the most restricted, and the simplest, languages in the Chomsky Hierarchy. Languages in this class are usually described like mathematical sets, with a description in curly brackets. Any word that matches that description is part of the language, and any word that doesn't match the description isn't part of the language.

Informally, the class of regular languages is those languages which can be described by characters from its alphabet with concatenation, parentheses, the ∪ (or) operator, and the * operator. An example would be ${\displaystyle \{(b\cup (ab))*\}}$, the language containing any number of as and bs, where every a is followed by a b.

It should be noted that any finite language, which is a language comprised of some finite sequence of words ${\displaystyle w_{1},w_{2},\ldots ,w_{n}}$, can be written as ${\displaystyle \{w_{1}\cup w_{2}\cup \ldots \cup w_{n}\}}$. As a result, every finite language is regular. The converse of this, however, is not true. The example language given earlier, ${\displaystyle \{(b\cup (ab))*\}}$, contains an infinite number of words, and is regular.

An alternative, though equivalent, definition of the regular languages is those languages which can be generated by a regular grammar. A regular grammar is a set of rules which, starting from an initial symbol, replace "non-terminal symbols" (symbols which are not in the alphabet of the language) with "terminal symbols" (symbols which are in the alphabet of the language) through replacement rules that replace each non-terminal symbol with either a terminal symbol, a terminal symbol followed by a non-terminal symbol, or ε (the empty string). As an example, the language ${\displaystyle \{(b\cup (ab))*\}}$ above can be described by the following grammar (let S be the starting symbol):

• S -> ε
• S -> aB
• S -> b
• S -> bA
• A -> aB
• A -> b
• A -> bA
• B -> bA
• B -> b

Note that by making different choices about which rules to apply when, this grammar can produce, from a starting string of "S", the empty string or a string of as and bs with every a followed by a b. This is exactly the same as the words matching the set notation above, making the languages produced by the two methods identical.

## Finite Automata

Finite automata are machines that consist of a finite number of states, and transition between those states based only on the current state and a character input. When the input has all been consumed, the machine either "accepts" or "rejects" the string of inputs, based on the final state of the machine. These machines happen to correspond exactly with regular languages, with the set of strings accepted by any finite automaton being a regular language, and with every regular language having a machine that accepts all and only strings from that language. For this reason we say that the class of regular languages is equivalent to the class of languages recognized by finite automata.

### DFAs & NFAs

There are two types of finite automata, deterministic and non-deterministic. A deterministic finite automaton, or DFA, makes one transition per character in the input, and contains a transition from each state for each character in the alphabet.

A non-deterministic finite automaton, or NFA, can consume either one or no characters per transition, does not need a transition for each character from each state, and can have more than one possible transition for a specific input and a specific state. These properties allow the NFA to contain computation branches. If any computation branch generated by a string accepts, the machine accepts that string, and only rejects if no branch accepts. Since DFAs are more restrictive than NFAs, every DFA is also an NFA, and any NFA can be converted into a DFA, so the set of languages recognized by all NFAs and DFAs is the same, and the two machines are equivalent.

For either type, the automaton can be specified completely by its set of states, its alphabet, the set of transitions, the start state, and the accepting states.

## Limitations

Not all languages are regular. Take the language ${\displaystyle \{a^{n}b^{n}|n\geq 0\}}$, whose words contain some number of as followed by an equal number of bs. The only memory that a finite automaton has is its current state, so it can count a number of as no larger than the number of states it possesses. Since the language contains all words of this type, the words can have an arbitrarily large number of as, so for any machine, there must be some strings whose number of as it cannot store. If it cannot store the number of as, then it cannot compare the number of as and bs, and so cannot verify that any string is a member of the language. As a result, there is no finite automaton that recognizes this language, and so it cannot be a regular language.

## Examples

The code below is a sample DFA emulator in Perl. Given the description of a machine and an input string, it simulates the machine processing the input string, and shows whether the machine accepts or not.

The syntax is: progname.pl DFAFile inputFile, where DFAFile is a text file containing the DFA instructions, and inputFile is a text file containing the input string. Some sample inputs, including a set of DFA instructions, can be found under sample DFA inputs

```#!usr/bin/perl
use Text::ParseWords;
use strict;
use warnings;

our (%states, %accepts, %alphabet, @rules, @string);

# Grabs the filenames for the machine and the word to be run on it.
my \$dfaFile = \$ARGV[0];
my \$input = \$ARGV[1];

# Uses subroutines to parse and verify the data in the input files.
# The data is stored in the global hashes and arrays, with the exception of the start state, which is provided here.
# The newstate is a temporary storage point for when a new state is calculated.
my \$newstate;

# For each symbol in the input word, the next state is calculated.
for (0..@string-1)
{
# The input word is printed, with the next symbol highlighted.
print "@string[0..\$_-1] <".\$string[\$_]."> @string[\$_+1..@string-1]\n";
# A new state is calculated.
\$newstate = \$rules[\$states{\$state}][\$alphabet{\$string[\$_]}];
# The state transition is printed.
print "State: ".\$state." -> ".\$newstate."\n\n";
# The state changes to the new state.
\$state = \$newstate;
}
# When the input is exhausted, the machine is in its final state.
print "Final state is ".\$state."\n";
# If that state is in the accept states list, the machine accepts the input.
if (defined(\$accepts{\$state})) { print "The machine accepts the string.\n"; }
# If not, the machine rejects.
else { print "The machine does not accept the string.\n" }

###################################################

# This subroutine reads the machine data from the specified file into variables (mostly hashes).
{
open(INFILE, shift) or die "Can't open ".\$dfaFile.": \$!";

# This block reads the list of states from the machine file.
<INFILE>;
my \$line = <INFILE>;
chomp(\$line);
my @words = &parse_line('\s+', 0, \$line);
my \$counter = 0;
for (@words)
{
# assigns each state name a number by creating a hash,
\$states{\$_} = \$counter;
\$counter++;
}
# and records the number of states.
my \$stateCount = \$counter;

# This block reads the start state from the machine file.
<INFILE>;
my \$startState = <INFILE>;
# takes the whole line as the start state,
chomp(\$startState);
# and makes sure that the start state is defined in the list of states.
defined(\$states{\$startState}) or die "The start state \$startState isn't a state!";

# This block reads the list of accepting states from the machine file.
<INFILE>;
\$line = <INFILE>;
chomp(\$line);
# breaks up the line into state names,
@words = &parse_line('\s+', 0, \$line);
for (@words)
{
# checks to make sure that the accept states are defined states,
defined(\$states{\$_}) or die "The start state \$_ isn't a state!";
# and defines those names in a new hash.  The use of a hash makes it easier to determine later if a specific state name accepts or not.
\$accepts{\$_} = 1;
}

# This block reads the list of symbols in the alphabet from the machine file.
<INFILE>;
\$line = <INFILE>;
chomp(\$line);
# breaks up the line into alphabet symbols (note that the symbols can be of arbitrary length),
@words = &parse_line('\s+', 0, \$line);
\$counter = 0;
for (@words)
{
# assigns each symbol a number, and creates a hash to reference them,
\$alphabet{\$_} = \$counter;
\$counter++;
}
# and then records the number of symbols in the alphabet.
my \$alphabetCount = \$counter;

# This block reads the state transition rules from the machine file.
<INFILE>;
while(<INFILE>)
{
# breaks each rule into start state, input symbol, and end state,
@words = &parse_line('\s+', 0, \$_);
# checks that all three pieces are defined in the state and alphabet hashes,
defined(\$alphabet{\$words[1]}) or die "\$words[1] isn't defined in the alphabet!";
defined(\$states{\$words[0]}) or die "\$words[0] isn't a defined state!";
defined(\$states{\$words[2]}) or die "\$words[2] isn't a defined state!";
# then uses the numbers assigned to the start state and the input symbol as indexes in a 2-dimensional array whose value is the name of the end state.
\$rules[\$states{\$words[0]}][\$alphabet{\$words[1]}] = \$words[2];
}

# This last block makes sure the set of rules is comprehensive.
for my \$i (0..\$stateCount-1)
{
defined(\$rules[\$i]) or die "Rules are incomplete.";
for my \$j (0..\$alphabetCount-1)
{
defined(\$rules[\$i][\$j]) or die "Rules are incomplete.";
}
}

# Reading complete, the subroutine closes the file and returns the name of the start state.
close INFILE;
return \$startState;
}