Compiler Construction/Lexical analysis
Contents |
[edit] Lexical Analysis
Lexical analysis (alternatively known as scanning or tokenization) is the process of scanning the stream of individual characters that make up the source code, expressed as characters (arranged on lines), into a sequence of lexical tokens ("words" and punctuation symbols that make up that source code) to feed to the parser. Roughly the equivalent of splitting ordinary text written in a natural language (e.g. English) into a sequence of words and punctuation symbols. Lexical analysis is often done with tools such as lex, flex and jflex.
Strictly speaking, tokenizing is not necessary, as it can be handled by the parser (it is a form of parsing in its own right). The reason why we tend to bother with tokenising in practice is that it makes the parser simpler, and decouples it from the character encoding used for the source code.
For example given the input string:
integer aardvark := 2, b;
a tokeniser might output the following tokens:
keyword integer word aardvark assignment operator integer 2 comma word b semi_colon
[edit] What is a token
In computing, a token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.
Consider the following table:
| lexeme | token |
|---|---|
| sum | IDENT |
| = | ASSIGN_OP |
| 3 | NUMBER |
| + | ADD_OP |
| 2 | NUMBER |
| ; | SEMICOLON |
Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyzer reads in a stream of lexemes and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.
Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.
Consider a text describing a calculation : "46 - number_of(cows);". The lexemes here might be: "46", "-", "number_of", "(", "cows", ")", and ";". The lexical analyzer will denote lexemes 4 and 6 as 'number' and - as character, and 'number_of ' as a separate token. Even the lexeme ';' in some languages (such as C) has some special meaning.
The whitespace lexemes are sometimes ignored later by the syntax analyzer. A token doesn't need to be valid, in order to be recognized as a token. "cows" may be nonsense to the language, "number_of" may be nonsense. But they are tokens nonetheless, in this example.
[edit] Finite State Automaton
We first study what is called finite state automaton, or FSA for short. FSA is usually used to do lexical analysis.
FSA consists of states, starting state, accept state and transition table. The automaton reads an input symbol and moves the state accordingly. If FSA reaches the accept state after the input string is read until its end, the string is said to be accepted or recognized. A set of recognized strings is said to be a language recognized by the FSA.
Suppose this language in which each string starts with 'ab' and ends with one or more 'c'. With state class, this can be written like this:
State
s0 = new State (),
s1 = new State (),
s2 = new State (),
s3 = new State ();
s0.setTransition ('a', s1);
s1.setTransition ('b', s2);
s2.setTransition ('c', s3);
s3.setTransition ('c', s3);
State r[] = s0.move (inputString);
if (Arrays.asList (r).contains (s3))
System.out.println ("the input string is accepted.");
else
System.out.println ("the input string is not accepted.");
Suppose an input string "abccc". Then the automaton moves like: s0 -> s1 -> s2 -> s3 -> s3 -> s3.
[edit] Simple Hand-Written Scanning
In this section, we'll create a simple, object-oriented scanner / lexer for a simple language implemented in Object Pascal. Consider the following EBNF:
program ::= { instruction }
instruction ::= "print" expr | "var" IDENTIFIER ":=" expr
expr ::= simple-expr [ ("<" | "<>" ) simple-expr ]
simple-expr ::= factor { ( "+" | "-" ) factor }
factor ::= INTEGER | IDENTIFIER
where
INTEGER = [0-9]+ IDENTIFIER = [A-Za-z_][A-Za-z0-9_]*
From the above EBNF, The tokens we're about to recognize are: IDENTIFIER, INTEGER, keyword "print", keyword "var", :=, +, -, <, <>, EOF and unknown token. The chosen tokens are intended for both brevity and the ability to recognize all types of token's lexeme: exact single character (+, -, <, EOF and unknown), exact multiple character (print, var, := and <>), infinitely many (IDENTIFIER, INTEGER and unknown), overlapping prefix (< and <>) and overlapping as a whole (IDENTIFIER and keywords). Identifier and keywords here are case-insensitive. Note that some lexemes are classified to more than one type of lexeme.
[edit] The Token Class
TToken = class protected FLine, FCol: LongWord; public constructor Create(const ALine, ACol: LongWord); destructor Destroy; override; end;
The base class of a token is simply an object containing line and column number where it's declared. From this base class, tokens with exact lexeme (either single or multiple characters) could be implemented as direct descendants.
TEOFToken = class(TToken) end; TPlusToken = class(TToken) end; TMinusToken = class(TToken) end; TAssignToken = class(TToken) end; TLessToken = class(TToken) end; TNotEqualToken = class(TToken) end; TPrintToken = class(TToken) end; TVarToken = class(TToken) end; TEOFToken = class(TToken) end;
Next, we need to create descendant for token with variadic lexeme:
TVariadicToken = class(TToken) protected FLexeme: String; public constructor Create(const ALine, ACol: LongWord; const ALexeme: String); destructor Destroy; override; end;
The only difference from the base token class is the lexeme property, since it's possibly has infinitely many form. From here, we create descendant classes for tokens whose lexeme is infinitely many:
TUnknownToken = class(TVariadicToken) end; TIdentifierToken = class(TVariadicToken) end; TIntegerToken = class(TVariadicToken) end;
That's all for the token, up to the lexer.
[edit] The Lexer Class
TLexer = class private FLine: LongWord; FCol: LongWord; FStream: TStream; FCurrentToken: TToken; FLastChar: Char; function GetChar: Char; public constructor Create(AStream: TStream); destructor Destroy; override; procedure NextToken; property CurrentToken: TToken read FCurrentToken; end;
A lexer consists of its position in the source code (line and column), the stream representing the source code, current (or last recognized / formed) token and last character read. To encapsulate the movement in the source code, reading character from the stream is implemented in GetChar method. Despite its maybe simple look, the implementation could be complicated as we'll see soon. GetChar is used by public method NextToken, whose job is to advance lexer movement by 1 token ahead. On to GetChar implementation:
function TLexer.GetChar: Char; begin try FLastChar := Chr(FStream.ReadByte); Inc(FCol); // Handles 3 types of line endings if FLastChar in [#13, #10] then begin FCol := 0; Inc(FLine); // CR met, probably CRLF if FLastChar = #13 then begin FLastChar := Chr(FStream.ReadByte); // Not CRLF, but CR only, move backward 1 position if FLastChar <> #10 then FStream.Seek(-1,soFromCurrent); end; // Always returns as LF for consistency FLastChar := #10; end; except // Exception? Returns EOF FLastChar := #26; end; GetChar := FLastChar; end;
As stated earlier, GetChar's job is not as simple as its name. First, it has to read one character from the input stream and increment the lexer's column position. Then it has to check whether this character is one of the possible line endings (our lexer is capable of handling CR-, LF- and CRLF-style line ending). Next is the core of our lexer, NextToken:
procedure TLexer.NextToken; var StartLine,StartCol: LongWord; function GetNumber: TIntegerToken; var NumLex: String; begin NumLex := ''; repeat NumLex := NumLex + FLastChar; FLastChar := GetChar; until not (FLastChar in ['0' .. '9']); Result := TIntegerToken.Create(StartLine, StartCol, NumLex); end; function GetIdentifierOrKeyword: TVariadicToken; var IdLex: String; begin IdLex := ''; repeat IdLex := IdLex + FLastChar; // This is how we handle case-insensitiveness FLastChar := LowerCase(GetChar); until not (FLastChar in ['a' .. 'z','0' .. '9','_']); // Need to check for keywords case IdLex of 'print' : Result := TPrintToken.Create(StartLine,StartCol); 'var' : Result := TVarToken.Create(StartLine,StartCol); otherwise Result := TIdentifier.Create(StartLine,StartCol,IdLex); end; end; begin // Eat whitespaces while FLastChar in [#32,#9,#13,#10] do FLastChar := GetChar; // Save first token position, since GetChar would change FLine and FCol StartLine := FLine; StartCol := FCol; if FLastChar = #26 then FCurrentToken := TEOFToken.Create(StartLine, StartCol) else case LowerCase(FLastChar) of // Exact single character '+': begin FCurrentToken := TPlusToken.Create(StartLine, StartCol); FLastChar := GetChar; end; '-': begin FCurrentToken := TMinusToken.Create(StartLine, StartCol); FLastChar := GetChar; end; // Exact multiple character ':': begin FLastChar := GetChar; if FLastChar = '=' then FCurrentToken := TAssignToken.Create(StartLine, StartCol) // ':' alone has no meaning in this language, therefore it's invalid else FCurrentToken := TUnknownToken.Create(StartLine, StartCol); end; // Overlapping prefix '<': begin FLastChar := GetChar; // '<>' if FLastChar = '>' then FCurrentToken := TNotEqualToken.Create(StartLine, StartCol) // '<' else FCurrentToken := TLessToken.Create(StartLine, StartCol); end; // Infinitely many is handled in its own function to cut down line length here '0' .. '9': begin FCurrentToken := GetNumber; end; 'a' .. 'z','_': begin FCurrentToken := GetIdentifierOrKeyword; end; else begin FCurrentToken := TUnknownToken.Create(StartLine, StartCol, FLastChar); FLastChar := GetChar; end; end; end;
As you can see, the core is a (probably big) case statement. The other parts is quite self documenting and well commented. Last but not least, the constructor:
constructor TLexer.Create(AStream: TStream; AErrorList: TErrorList); begin FStream := AStream; FLine := 1; FCol := 0; FLastChar := GetChar; NextToken; end;
It sets up the initial line and column position (guess why it's 1 for line but 0 for column :)), and also sets up the first token available so CurrentToken would be available after calling the constructor, no need to explicitly call NextToken after that.
[edit] Test Program
uses Classes,SysUtils,lexer,tokens; var Stream: TStream; lex: TLexer; begin Stream := THandleStream.Create(StdInputHandle); lex := TLexer.Create(Stream,nil); while not(lex.CurrentToken is TEOFToken) do begin WriteLn(lex.CurrentToken.ToString); lex.NextToken; end; lex.Free; Stream.Free; end.
As an exercise, you could try extending the lexer with floating point numbers, strings, numbers with base other than 10, scientific notation, comments, etc.
[edit] Table-Driven Hand-Written Scanning
[edit] Scanning via a Scanner Generator via a Table Supplied by the User
There are three scanner generator methods. Mathematically, either the information is coded as a regular expression or expression or inside code or not. At Compilers Principles Techniques and Tools 2nd edition it helps to start with some other description this article proves that this other description is valid. A perfect general skeleton compiler can be formed from the extended program. A perfect grep for patterns length two can be formed. If a compiler is built without many redundant operations such as keywords key symbols according to frequency, decimal numbers converted from characters to digits without many redundant operations in case of fail, then a scanner generator like this is needed. One pass lookahead needs an infinite lookahead in case of hexadecimal numbers ending with H alone only which means lookahead is not a possible method because of this case. In the ACM "Lexical Analysis Tool" Article (PDF), this method is described as beginning with a table to perform lexical analysis. Lexicalising where there are unclassified general id's (like 2a^nb^nc^nd^n which is important in case of letters with double underline identification) and unclassified general numbers. Also in the file, any key symbols not surrounded with blanks, with longest key symbols accepted. Also any keywords surrounded with blanks or key symbols, accepted. Also with strings untouched, and comments skipped. i.e. for l patterns, at ll length of text, maximum pattern length le, patterns include random integers (e.g. random(1 to 2)random(1 to 2) for example), random id's (e.g. random(a to e)random(a to b)random(a to e) for example), large sigma alphabet, then the naive string matching minimum time is O(ll*l*|sigma|**le), the deterministic finite automaton method minimum time is O(|sigma|**le)+O(ll), this method is O(ll*le**2*l).
The skeleton compiler writing method offered as a "Lexical Analysis Tool" can be one of twenty five commercially accepted compiler skeleton methods published till publishing date. It solves with iterative usage of keyword tables, with translation as comments, then comments symbols removed, i.e. bottom up parsing, i.e. shift reduce (which does not have stack overflow and backtracking problems of top down parsing), 8500 programming languages, published academically till publishing date. Allows lexical analyses of 8500 programming languages published academically till publishing date, professionally (pre publishing date academically published ad hoc methods did not solve the +++++ compiling, file of +'s with key symbol - before + at the compiler can be optimised by changing key symbol table order, parallel requires signal replication error prone). It provides constructs of databases, operating systems, memory management, and gives a skeleton of a window with mouse keyclicks. It might be good for writing a simple compiler for compiling a statement like print("hello"), and understanding how it works. It might be also good as a safebox scanner for emergencies at airports. It also provides a more efficient grep implementation for an expression length l at a file length l (same with (one symbol +) pre expression post expression) (with two symbols same operations no fail function). (grep is found at 50 billion copies on windows computers which run grep=findstr.exe at booting). grep is improved as is by the article extended program as is in many cases. It is more efficient than lex at some cases. Compilers Principles Techniques and Tools 1986 edition by Ullman page 22 unworkable is corrected. It provides an operating system command line interpreter, which can be optimised according to usage, by the keywords table. Same way optimise instruction register (by an instruction table sequentially searched according to frequency), keyboard (change keyboard for symbol keys more common than numerical keys and vice versa paths of electricity). It might also be good for a compiler which click after click advances the compiling showing the intercompiling results. It might also be good for processing a file into a file with blanks which is more readable, and for translating a file by steps. It might also be good for copying files without fail. It might also be good for manufacturing new programming languages with fast lexical analysis. It might also be for manufacturing a chip to monitor transactions on a bus according to patterns like card numbers. It might also be good for digital photo focusing. It might also be good as a search engine.
Fortran can be lexical analysed five times faster than Fortran lexical analysers published academically till publishing date by an extended program which puts labels into keywords table, then iteration for no blanks, then with an extended filler function, lexical analysis. All lexical analysers published academically till publishing date are equal or slower than the extended program by one to two factor or more.
There is a lexical analysis which can not be achieved by any means with one pass, but can be achieved with two passes, same with three passes, when there is no information about the tokens, and we want to identify tokens, then there is some information about the tokens, and we want to identify tokens, etc.
Lexical analysis is the base of all programming languages. This is the method for programming languages according to page 22 of the book Compilers Principles Techniques and Tools, which writes that this method is valid, and tempting, while the other methods require much more work by the user, as written at Principles of Compiler Design page 22. This method solves the problem by lambda calculus matrix analysis arithmetic, instead of by DFA's with many states, i.e. for a large source program, to be compiled many times, an ad hoc compiler might have many redundant operations, while this method can be optimised according to frequency of key symbols, by a key symbol table. Lex requires more work and creates large tables i.e. this is the method for optimised compiling in this case. This might be the method of choice when lexicalising hexadecimal numbers(without 0x at the beginning, which is far from the source, taking into consideration that sometimes most of the computer file is hexadecimal numbers, and we might want to convert hexadecimal digits to binary). An example of two passes instead of one is, when we might have a keyword at length of encyclopedia or less, with longest match required, the text is half encyclopedia++half encyclopedia, then the lookahead at one pass must be a whole encyclopedia, which will accumalate to O(encyclopedia ** 2), two passes will be O(2*encyclopedia + encyclopedia) which is less for a large encyclopedia.
[edit] Scanning via a Tool - lex/flex
Wikipedia Flex lexical analyzer article.
[edit] Scanning via a Tool - JavaCC
JavaCC is the standard Java compiler-compiler. Unlike the other tools presented in this chapter, JavaCC is a parser and a scanner (lexer) generator in one. JavaCC takes just one input file (called the grammar file), which is then used to create both classes for lexical analysis, as well as for the parser.
In JavaCC's terminology the scanner/lexical analyser is called the token manager. And in fact the generated class that contains the token manager is called ParserNameTokenManager. Of course, following the usual Java file name requirements, the class is stored in a file called ParserNameTokenManager.java. The ParserName part is taken from the input file. In addition, JavaCC creates a second class, called ParserNameConstants. That second class, as the name implies, contains definitions of constants, especially token constants. JavaCC also generates a boilerplate class called Token. That one is always the same, and contains the class used to represent tokens. One also gets a class called ParseError. This is an exception which is thrown if something went wrong.
It is possible to instruct JavaCC not to generate the ParserNameTokenManger, and instead provide your own, hand-written, token manager. Usually - this holds for all the tools presented in this chapter - a hand-written scanner/lexical analyser/token manager is much more efficient. So, if you figure out that your generated compiler gets too large, give the generated scanner/lexical analyzer/token manager a good look. Rolling your own token manager is also handy if you need to parse binary data and feed it to the parsing layer.
Since, however, this section is about using JavaCC to generate a token manager, and not about writing one by hand, this is not discussed any further here.
[edit] Defining Tokens in the JavaCC Grammar File
A JavaCC grammar file usually starts with code which is relevant for the parser, and not the scanner. For simple grammars files it looks similar to:
options { LOOKAHEAD=1; }
PARSER_BEGIN(ParserName)
public class ParserName {
// code to enter the parser goes here
}
PARSER_END(ParserName)
This is usually followed by the definitions for tokens. These definitions are the information we are interested in in this chapter. Four different kinds, indicated by four different keywords are understood by JavaCC when it comes to the definition of tokens:
- TOKEN
- Regular expressions which specify the tokens the token manager should be able to recognize.
- SPECIAL_TOKEN
- SPECIAL_TOKENs are similar to TOKENS. Only, that the parser ignores them. This is useful to e.g. specify comments, which are supposed to be understood, but have no significance to the parser.
- SKIP
- Tokens (input data) which is supposed to be completely ignored by the token manager. This is commonly used to ignore whitespace. A SKIP token still breaks up other tokens. E.g. if one skips white space, has a token "else" defined, and if the input is "el se", then the token is not matched.
- MORE
- This is used for an advanced technique, where a token is gradually built. MORE tokens are put in a buffer until the next TOKEN or SPECIAL_TOKEN matches. Then all data, the accumulated token in the buffer, as well as the last TOKEN or SPECIAL_TOKEN is returned.
- One example, where the usage of MORE tokens is useful are constructs where one would like to match for some start string, arbitrary data, and some end string. Comments or string literals in many programming languages match this form. E.g. to match string literals, delimited by ", one would not return the first found " as a token. Instead, one would accumulate more tokens, until the closing " of the string literal is found. Then the complete literal would be returned. See Comment Example for an example where this is used to scan comments.
Each of the above mentioned keywords can be used as often as desired. This makes it possible to group the tokens, e.g. in a list for operators and another list for keywords. All sections of the same kind are merged together as if just one section had been specified.
Every specification of a token consists of the token's symbolic name, and a regular expression. If the regular expression matches, the symbol is returned by the token manager.
[edit] Simple Example
Lets see an example:
//
// Skip whitespace in input
// if the input matches space, carriage return, new line or a tab,
// it is just ignored
//
SKIP: { " " | "\r" | "\n" | "\t" }
// Define the tokens representing our operators
//
TOKEN: {
< PLUS: "+" >
| < MINUS: "-" >
| < MUL: "*" >
| < DIV: "/" >
}
//
// Define the tokens representing our keywords
//
TOKEN: {
< IF: "if" >
| < THEN: "then" >
| < ELSE: "else" >
}
All the above token definitions use simple regular expressions, where just constants are matched. It is recommended to study the JavaCC documentation for the full set of possible regular expressions.
When the above file is run through JavaCC, a token manager is generated which understands the above declared tokens.
[edit] Comment Example
Eliminating (ignoring) comments in a programming language is a common task for a lexical analyzer. Of course, when JavaCC is used, this task is usually given to the token manager, by specifying special tokens.
Basically, a standard idiom for JavaCC has evolved on how to ignore comments. It combines tokens of kind SPECIAL_TOKEN and MORE with a lexical state. Lets assume we e.g. have a programming language where comments start with a --, and either end with another -- or the end of the line (this is the comment schema of ASN.1 and several other languages). Then a way to construct a scanner for this would be:
//
// Start of comment. Don't return as token, instead
// shift to special lexical state.
//
SPECIAL_TOKEN: { <"--"> : WithinComment }
//
// While inside the comment lexicals state, look for end
// of comment (either another '--' or EOL). Don't return
// the (accumulated data). Instead switch back to normal
// lexical state
//
<WithinComment> SPECIAL_TOKEN:
{
<("--" | "\n" )> : DEFAULT
}
//
// While inside the comment state, accumulate all contents
//
<WithinComment> MORE: { <~[]> }
The