Compiler Construction/Lexical analysis

A Wikibookian has nominated this page for cleanup.
You can help make it better. Please review any relevant discussion.

Lexical Analysis

Lexical analysis is the process of analyzing a stream of individual characters (normally arranged as lines), into a sequence of lexical tokens (tokenization. for instance of "words" and punctuation symbols that make up source code) to feed into the parser. Roughly the equivalent of splitting ordinary text written in a natural language (e.g. English) into a sequence of words and punctuation symbols. Lexical analysis is often done with tools such as lex, flex and jflex.

Strictly speaking, tokenization may be handled by the parser. The reason why we tend to bother with tokenising in practice is that it makes the parser simpler, and decouples it from the character encoding used for the source code.

For example given the input string:

     integer aardvark := 2, b;

A tokeniser might output the following tokens:

keyword integer
word aardvark
assignment operator
integer 2
comma
word b
semi_colon

What is a token

In computing, a token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.

Consider the following table:

lexeme	token
sum	IDENT
=	ASSIGN_OP
3	NUMBER
+	ADD_OP
2	NUMBER
;	SEMICOLON

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyzer reads in a stream of lexemes and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation : "46 - number_of(cows);". The lexemes here might be: "46", "-", "number_of", "(", "cows", ")", and ";". The lexical analyzer will denote lexemes 4 and 6 as 'number' and - as character, and 'number_of ' as a separate token. Even the lexeme ';' in some languages (such as C) has some special meaning.

The whitespace lexemes are sometimes ignored later by the syntax analyzer. A token doesn't need to be valid, in order to be recognized as a token. "cows" may be nonsense to the language, "number_of" may be nonsense. But they are tokens nonetheless, in this example.

Finite State Automaton

We first study what is called a finite state automaton, or FSA for short. An FSA is usually used to do lexical analysis.

An FSA consists of states, starting state, accept state and transition table. The automaton reads an input symbol and moves the state accordingly. If the FSA reaches the accept state after the input string is read until its end, the string is said to be accepted or recognized. A set of recognized strings is said to be a language recognized by the FSA.

Suppose this language in which each string starts with 'ab' and ends with one or more 'c'. With state class, this can be written like this:

State
  s0 = new State (),
  s1 = new State (),
  s2 = new State (),
  s3 = new State ();
		
s0.setTransition ('a', s1);
s1.setTransition ('b', s2);
s2.setTransition ('c', s3);
s3.setTransition ('c', s3);

State r[] = s0.move (inputString);
if (Arrays.asList (r).contains (s3))
  System.out.println ("the input string is accepted.");
else
  System.out.println ("the input string is not accepted.");

Suppose an input string "abccc". Then the automaton moves like: s0 -> s1 -> s2 -> s3 -> s3 -> s3.

Simple Hand-Written Scanning

In this section, we'll create a simple, object-oriented scanner / lexer for a simple language implemented in Object Pascal. Consider the following EBNF:

program     ::= { instruction }
instruction ::= "print" expr | "var" IDENTIFIER ":=" expr
expr        ::= simple-expr [ ("<" | "<>" ) simple-expr ]
simple-expr ::= factor { ( "+" | "-" ) factor }
factor      ::= INTEGER | IDENTIFIER

where

INTEGER    = [0-9]+
IDENTIFIER = [A-Za-z_][A-Za-z0-9_]*

From the above EBNF, The tokens we're about to recognize are: IDENTIFIER, INTEGER, keyword "print", keyword "var", :=, +, -, <, <>, EOF and unknown token. The chosen tokens are intended for both brevity and the ability to recognize all types of token's lexeme: exact single character (+, -, <, EOF and unknown), exact multiple character (print, var, := and <>), infinitely many (IDENTIFIER, INTEGER and unknown), overlapping prefix (< and <>) and overlapping as a whole (IDENTIFIER and keywords). Identifier and keywords here are case-insensitive. Note that some lexemes are classified to more than one type of lexeme.

The Token Class

TToken = class
protected
  FLine, FCol: LongWord;
public
  constructor Create(const ALine, ACol: LongWord);
  destructor Destroy; override;
end;

The base class of a token is simply an object containing line and column number where it's declared. From this base class, tokens with exact lexeme (either single or multiple characters) could be implemented as direct descendants.

TEOFToken = class(TToken)
end;

TPlusToken = class(TToken)
end;

TMinusToken = class(TToken)
end;

TAssignToken = class(TToken)
end;

TLessToken = class(TToken)
end;

TNotEqualToken = class(TToken)
end;

TPrintToken = class(TToken)
end;

TVarToken = class(TToken)
end;

TEOFToken = class(TToken)
end;

Next, we need to create descendant for token with variadic lexeme:

TVariadicToken = class(TToken)
protected
  FLexeme: String;
public
  constructor Create(const ALine, ACol: LongWord; const ALexeme: String);
  destructor Destroy; override;
end;

The only difference from the base token class is the lexeme property, since it possibly has infinitely many forms. From here, we create descendant classes for tokens whose lexeme is infinitely many:

TUnknownToken = class(TVariadicToken)
end;

TIdentifierToken = class(TVariadicToken)
end;

TIntegerToken = class(TVariadicToken)
end;

That's all for the token, on to the lexer.

The Lexer Class

TLexer = class
private
  FLine: LongWord;
  FCol: LongWord;
  FStream: TStream;
  FCurrentToken: TToken;
  FLastChar: Char;
  function GetChar: Char;
public
  constructor Create(AStream: TStream);
  destructor Destroy; override;
  procedure NextToken;
  property CurrentToken: TToken read FCurrentToken;
end;

A lexer consists of its position in the source code (line and column), the stream representing the source code, current (or last recognized / formed) token and last character read. To encapsulate the movement in the source code, reading character from the stream is implemented in GetChar method. Despite its maybe simple look, the implementation could be complicated as we'll see soon. GetChar is used by public method NextToken, whose job is to advance lexer movement by 1 token ahead. On to GetChar implementation:

function TLexer.GetChar: Char;
begin
  try
    FLastChar := Chr(FStream.ReadByte);
    Inc(FCol);
    // Handles 3 types of line endings
    if FLastChar in [#13, #10] then begin
      FCol := 0;
      Inc(FLine);
      // CR met, probably CRLF
      if FLastChar = #13 then begin
        FLastChar := Chr(FStream.ReadByte);
        // Not CRLF, but CR only, move backward 1 position
        if FLastChar <> #10 then
          FStream.Seek(-1,soFromCurrent);
      end;
      // Always returns as LF for consistency
      FLastChar := #10;
    end;
  except // Exception? Returns EOF
    FLastChar := #26;
  end;
  GetChar := FLastChar;
end;

As stated earlier, GetChar's job is not as simple as its name. First, it has to read one character from the input stream and increment the lexer's column position. Then it has to check whether this character is one of the possible line endings (our lexer is capable of handling CR-, LF- and CRLF-style line ending). Next is the core of our lexer, NextToken:

procedure TLexer.NextToken;
var
  StartLine,StartCol: LongWord;

  function GetNumber: TIntegerToken;
  var
    NumLex: String;
  begin
    NumLex := '';
    repeat
      NumLex := NumLex + FLastChar;
      FLastChar := GetChar;
    until not (FLastChar in ['0' .. '9']);
    Result := TIntegerToken.Create(StartLine, StartCol, NumLex);
  end;
  
  function GetIdentifierOrKeyword: TVariadicToken;
  var
    IdLex: String;
  begin
    IdLex := '';
    repeat
      IdLex := IdLex + FLastChar;
      // This is how we handle case-insensitiveness
      FLastChar := LowerCase(GetChar);
    until not (FLastChar in ['a' .. 'z','0' .. '9','_']);
    // Need to check for keywords
    case IdLex of
      'print' : Result := TPrintToken.Create(StartLine,StartCol);
      'var'   : Result := TVarToken.Create(StartLine,StartCol);
      otherwise Result := TIdentifier.Create(StartLine,StartCol,IdLex);
    end;
  end;

begin
  // Eat whitespaces
  while FLastChar in [#32,#9,#13,#10] do
    FLastChar := GetChar;
  // Save first token position, since GetChar would change FLine and FCol
  StartLine := FLine;
  StartCol := FCol;
  if FLastChar = #26 then
    FCurrentToken := TEOFToken.Create(StartLine, StartCol)
  else
    case LowerCase(FLastChar) of
      // Exact single character
      '+': begin
        FCurrentToken := TPlusToken.Create(StartLine, StartCol);
        FLastChar := GetChar;
      end;
      '-': begin
        FCurrentToken := TMinusToken.Create(StartLine, StartCol);
        FLastChar := GetChar;
      end;
      // Exact multiple character
      ':': begin
        FLastChar := GetChar;
        if FLastChar = '=' then
          FCurrentToken := TAssignToken.Create(StartLine, StartCol)
        // ':' alone has no meaning in this language, therefore it's invalid
        else
          FCurrentToken := TUnknownToken.Create(StartLine, StartCol);
      end;
      // Overlapping prefix
      '<': begin
        FLastChar := GetChar;
        // '<>'
        if FLastChar = '>' then
          FCurrentToken := TNotEqualToken.Create(StartLine, StartCol)
        // '<'
        else
          FCurrentToken := TLessToken.Create(StartLine, StartCol);
      end;
      // Infinitely many is handled in its own function to cut down line length here
      '0' .. '9': begin
        FCurrentToken := GetNumber;
      end;
      'a' .. 'z','_': begin
        FCurrentToken := GetIdentifierOrKeyword;
      end;
      else begin
        FCurrentToken := TUnknownToken.Create(StartLine, StartCol, FLastChar);
        FLastChar := GetChar;
      end;
    end;
end;

As you can see, the core is a (probably big) case statement. The other parts is quite self documenting and well commented. Last but not least, the constructor:

constructor TLexer.Create(AStream: TStream; AErrorList: TErrorList);
begin
  FStream := AStream;
  FLine := 1;
  FCol := 0;
  FLastChar := GetChar;
  NextToken;
end;

It sets up the initial line and column position (guess why it's 1 for line but 0 for column :)), and also sets up the first token available so CurrentToken would be available after calling the constructor, no need to explicitly call NextToken after that.

Test Program

uses
  Classes,SysUtils,lexer,tokens;
var
  Stream: TStream;
  lex: TLexer;
begin
  Stream := THandleStream.Create(StdInputHandle);
  lex := TLexer.Create(Stream,nil);
  while not(lex.CurrentToken is TEOFToken) do begin
    WriteLn(lex.CurrentToken.ToString);
    lex.NextToken;
  end;
  lex.Free;
  Stream.Free;
end.

As an exercise, you could try extending the lexer with floating point numbers, strings, numbers with base other than 10, scientific notation, comments, etc.

Table-Driven Hand-Written Scanning

Compiling Channel Constant

Lexical Analysis Tool

Scanning via a Tool - lex/flex

To do:
Mine Wikipedia:en:Flex lexical analyser article.

Scanning via a Tool - JavaCC

JavaCC is the standard Java compiler-compiler. Unlike the other tools presented in this chapter, JavaCC is a parser and a scanner (lexer) generator in one. JavaCC takes just one input file (called the grammar file), which is then used to create both classes for lexical analysis, as well as for the parser.

In JavaCC's terminology the scanner/lexical analyser is called the token manager. And in fact the generated class that contains the token manager is called ParserNameTokenManager. Of course, following the usual Java file name requirements, the class is stored in a file called ParserNameTokenManager.java. The ParserName part is taken from the input file. In addition, JavaCC creates a second class, called ParserNameConstants. That second class, as the name implies, contains definitions of constants, especially token constants. JavaCC also generates a boilerplate class called Token. That one is always the same, and contains the class used to represent tokens. One also gets a class called ParseError. This is an exception which is thrown if something went wrong.

It is possible to instruct JavaCC not to generate the ParserNameTokenManger, and instead provide your own, hand-written, token manager. Usually - this holds for all the tools presented in this chapter - a hand-written scanner/lexical analyser/token manager is much more efficient. So, if you figure out that your generated compiler gets too large, give the generated scanner/lexical analyzer/token manager a good look. Rolling your own token manager is also handy if you need to parse binary data and feed it to the parsing layer.

Since, however, this section is about using JavaCC to generate a token manager, and not about writing one by hand, this is not discussed any further here.

Defining Tokens in the JavaCC Grammar File

A JavaCC grammar file usually starts with code which is relevant for the parser, and not the scanner. For simple grammars files it looks similar to:

options { LOOKAHEAD=1; }

PARSER_BEGIN(ParserName)
public class ParserName {
   // code to enter the parser goes here
}
PARSER_END(ParserName)

This is usually followed by the definitions for tokens. These definitions are the information we are interested in in this chapter. Four different kinds, indicated by four different keywords are understood by JavaCC when it comes to the definition of tokens:

TOKEN: Regular expressions which specify the tokens the token manager should be able to recognize.

SPECIAL_TOKEN: SPECIAL_TOKENs are similar to TOKENS. Only, that the parser ignores them. This is useful to e.g. specify comments, which are supposed to be understood, but have no significance to the parser.

SKIP: Tokens (input data) which is supposed to be completely ignored by the token manager. This is commonly used to ignore whitespace. A SKIP token still breaks up other tokens. E.g. if one skips white space, has a token "else" defined, and if the input is "el se", then the token is not matched.

MORE: This is used for an advanced technique, where a token is gradually built. MORE tokens are put in a buffer until the next TOKEN or SPECIAL_TOKEN matches. Then all data, the accumulated token in the buffer, as well as the last TOKEN or SPECIAL_TOKEN is returned.

One example, where the usage of MORE tokens is useful are constructs where one would like to match for some start string, arbitrary data, and some end string. Comments or string literals in many programming languages match this form. E.g. to match string literals, delimited by ", one would not return the first found " as a token. Instead, one would accumulate more tokens, until the closing " of the string literal is found. Then the complete literal would be returned. See Comment Example for an example where this is used to scan comments.

Each of the above mentioned keywords can be used as often as desired. This makes it possible to group the tokens, e.g. in a list for operators and another list for keywords. All sections of the same kind are merged as if just one section had been specified.

Every specification of a token consists of the token's symbolic name, and a regular expression. If the regular expression matches, the symbol is returned by the token manager.

Simple Example

Lets see an example:

//
// Skip whitespace in input
// if the input matches space, carriage return, new line or a tab,
// it is just ignored
//
SKIP: { " " | "\r" | "\n" | "\t" }

// Define the tokens representing our operators
//
TOKEN: {
    < PLUS:  "+" >
  | < MINUS: "-" >
  | < MUL:   "*" >
  | < DIV:   "/" >
}

//
// Define the tokens representing our keywords
//
TOKEN: {
    < IF:    "if" >
  | < THEN:  "then" >
  | < ELSE:  "else" >
}

All the above token definitions use simple regular expressions, where just constants are matched. It is recommended to study the JavaCC documentation for the full set of possible regular expressions.

When the above file is run through JavaCC, a token manager is generated which understands the above declared tokens.

Comment Example

Eliminating (ignoring) comments in a programming language is a common task for a lexical analyzer. Of course, when JavaCC is used, this task is usually given to the token manager, by specifying special tokens.

Basically, a standard idiom for JavaCC has evolved on how to ignore comments. It combines tokens of kind SPECIAL_TOKEN and MORE with a lexical state. Lets assume we e.g. have a programming language where comments start with a --, and either end with another -- or the end of the line (this is the comment schema of ASN.1 and several other languages). Then a way to construct a scanner for this would be:

//
// Start of comment. Don't return as token, instead
// shift to special lexical state.
//
SPECIAL_TOKEN: {  <"--"> : WithinComment }

//
// While inside the comment lexicals state, look for end
// of comment (either another '--' or EOL). Don't return
// the (accumulated data). Instead switch back to normal
// lexical state
//
<WithinComment> SPECIAL_TOKEN:
{
    <("--" | "\n" )> : DEFAULT
}

//
// While inside the comment state, accumulate all contents
//
<WithinComment> MORE: { <~[]> }

To do:
Document lexical states; Token manager declaration; Details about the regular expressions; internal (#) regular expressions.