Compiler Construction/Introduction

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

This is the introductory chapter for Compiler Construction.

You should read most of this chapter, since the rest of the book will assume it as background information.

Compiler construction is normally considered as an advanced rather than a novice programming task, mainly due to the quantity of code needed (and the difficulties of grokking this amount of code) rather than the difficulty of any particular coding constructs.

Contents

[edit] Assumptions

[edit] Appropriate Background

To get the most out of this wikibook, your background/experience/capabilities should include most of the following:

  • Have a few years experience in one or (preferably) more high-level programming languages.
  • Be able to cope with programs containing several thousand lines of source code.
  • You don't need any specific mathematical background, though some acquaintance with simple algebra might be helpful.
  • Programming requires a logical mind capable of precise attention to detail.
  • Be prepared to learn about various computer science concepts. In particular, you should be able to understand EBNF as described in Section 6 of this chapter. Other concepts will be introduced as required, e.g. stacks, recursion.
  • Be prepared to learn about simple assembly languages and about various aspects of computer architecture and addressing modes.

Even if you do not have the background given above you should still be able to pick up a few ideas by reading the text. You may be able to implement some or all of the programs supplied in this wikibook. If you have some high-level programming experience you should also be able to play around at extending the capabilities of the interpreter.

[edit] Implementation Software

Since this is a freely available book, it seems appropriate to use some freely available implementation language. I have chosen to use a subset of C++ in conjunction with the GNU GCC compiler. The subset corresponds roughly to using C++ as a better C, including use of standard C++ libraries, but not getting into the object-oriented side of C++.

If you use GNU/Linux then this compiler will/should already be available to you. Otherwise you will have to search the World Wide Web for a free version which will run on your system.

As a starting point for your search you might like to consider this website or this Swedish website which contain a variety of freeware downloads.
If you use Microsoft Windows then you might like to consider Code::Blocks, a free IDE(Integrated Development Environment) with the MinGW compiler(a port of GCC), which is available from www.codeblocks.org

[edit] Simplify the Source Language

To reduce the complexity of this book, we will make some simplifying assumptions about the source language which we write a compiler for:

  1. The source language does not allow subroutine declarations to be nested.
  2. All identifiers must be declared before use.
  3. Syntax analysis of the source language can be done in one pass.
  4. During syntax analysis, we need only look at the next source token to decide what to do next.

In later chapters we will give more consideration to the effects of relaxing these assumptions.

In practical terms these assumptions mean that the source language is easy to translate using simple techniques, and that some run-time complexity has been eliminated.

The first assumption rules out Algol 60 and Pascal.
The second assumption allows one-pass compilation.
The third assumption rules out Algol 68; according to Hunter, up to 4 passes may be needed for syntax analysis.
The second and fourth assumptions rule out Fortran; the following are all valid Fortran statements, but you can't even decide what sort of statement each is until you have looked near/at the end of the statement (in Fortran, spaces are not significant, variables need not be declared, and there are no reserved keywords).

Example

Output

IF (expression) GO TO 567
IF (expression) GO TO = 2.71
IF (expression) 123,234
IF (expression) 123,234,345
IF (expression) THEN
IF (expression) THEN = 3.14
IF (expression) = 36
DO 10 I = 1,5
DO 10 I = 1.5

conditional jump
conditional assignment
2-way branch: false, true
3-way branch: <0, =0, >0
start multi-line if
conditional assignment
assignment to array element
start a loop
assignment to variable DO10I

The difference between the last two lines is reported to have caused the loss of a space probe costing millions of dollars.

[edit] Requirements

Any compiler has some essential requirements, which are perhaps more stringent than for most programs:

  • Any valid program must be translated correctly, i.e. no incorrect translation is allowed.
  • Any invalid program must be rejected and not translated.

There will inevitably be some valid programs which can't be translated due to their size or complexity in relation to the hardware available, for example problems due to memory size. The compiler may also have some fixed-size tables which place limits on what can be compiled (some language definitions place explicit lower bounds on the sizes of certain tables, to ensure that programs of reasonable size/complexity can be compiled).

There are also some desirable requirements, some of which may be mutually exclusive:

  • Errors should be reported in terms of the source language or program.
  • The position at which an error was detected should be indicated; if the actual error probably occurred somewhat earlier then some indication of possible cause(s) should also be provided.
  • Compilation should be fast.
  • The translated program should be fast.
  • The translated program should be small.
  • If the source language has some national or international standard:
    • Ideally the entire standard should be implemented.
    • Any restrictions or limits should be well and clearly documented.
    • If extensions to the standard have been implemented:
      • These extensions should be documented as such.
      • There should be some way of turning these extensions off.

There are also some possibly controversial requirements to consider (see chapter on dealing with errors:

  • Errors detected when the translated program is running should still be reported in relation to the original source program e.g. line number.
  • Errors detected when the translated program is running should include division by 0, running out of memory, use of an array subscript/index which is too big or too small, attempted use of an undefined variable, incorrect use of pointers, etc.

[edit] Overall Structure

For ease of exposition we will divide the compiler into a front end and a back end. These need not even be written in the same implementation language, providing they can communicate effectively via some intermediate representation.

The following lists itemise the tasks carried out by the front end and the back end. Note that the tasks are not carried out in any particular order, as outlined below, and discussed in more detail in subsequent chapters.

  • Front end
    • lexical analysis - convert characters to tokens
    • syntax analysis - check for valid sequence of tokens
    • semantic analysis - check for meaning
    • some global/high-level optimisation
  • Back end
    • some local optimisation
    • register allocation
    • peep-hole optimisation
    • code generation
    • instruction scheduling

Almost all the source-language aspects are handled by the front end. It is possible to have different front ends for different high-level languages, and a common back end which does most of the optimisation.

Almost all the machine-dependent aspects are handled by the back end. It is possible to have different back ends for different computers so that the compiler can produce code for different computers.

The front end is normally controlled by the syntax analysis processing. As necessary, the syntax analysis code will call a routine which performs some lexical analysis and returns the next token. At selected points during syntax analysis, appropriate semantic routines are called which perform any relevant semantic checks and/or add information to the internal representation.

[edit] Performance

'Optimization' to improve space and/or speed for the translated program is discussed in a later chapter. This section is concerned with the performance of the compiler itself.

On modern computers, a compiler can be considered to have satisfactory perfomance if it translates a moderate size source program (say about 1000 lines) in a matter of seconds. The way to get a compiler with satisfactory performance is more or less the same way you would get any program performing well.

  • Design using good algorithms.
  • Ensure your data structures match the algorithms.
  • Structure using modules with clean simple interfaces.
  • Implement using clear straightforward code.
  • When there is an overall performance problem
    • Measure the actual performance in reasonable detail.
    • Identify the troublesome areas.
    • Redesign and re-implement these problem areas.

In this book we will consider various algorithms and data structures and discuss their likely impact on performance.

Note that actual measurement is crucial, since the problems are often not where you guess they might be. For your initial implementation you may well have selected simple algorithms which are known to perform poorly in order to get something working quickly. Nevertheless, you should still measure performance in detail, since there may be some other source of (at least some of) your problems.

If you are very lucky, your implementation language might have some optional facility for selectively measuring CPU time. Take care to only activate such a feature for a few crucial routines; the timing overhead could easily exceed the execution time for small routines and distort the result.
More commonly you will be unlucky and will have to explicitly add timing code to selected routines; make sure you can easily disable it and enable it as required. Typically you will have to insert calls to some CPU timing routine at the beginning and end of a routine, and then subtract the two values to get the time for that routine, which will include the time for any routines called by it.

Various measurements on the performance of actual compilers have been reported over the years. Specific areas which have been known to cause problems include:

  • Multiple routine calls during lexical analysis for each and every source character.
  • Skipping white space during lexical analysis.
  • Skipping over a comment.
  • Decoding a tightly packed parse table during syntax analysis.
  • Looking things up in the name table during semantic analysis.
  • Determining whether some name is a reserved keyword or a user-definable identifier.

[edit] Describing a Programming Language

[edit] Background Information

Over the years it has been found that trying to describe a programming language using some natural language (e.g. English) is unsatisfactory due to possible omissions, contradictions, ambiguities, and vagueness. A significant milestone was the publication of the Algol 60 Report which used a formal grammar (BNF) to describe the syntax of the language. The semantics (meaning) was carefully described using English. A variation of Extended BNF (EBNF) will be used in this book as described later in this section.

There have been several attempts to formalise the semantics, including the grammar generator used for defining Algol 68, and the Vienna Definition Language which was used at one stage to define the meaning of PL/1. While theoretically adequate, these attempts are so complicated that they have proved to be of little practical use.

In 1956, Chomsky (a linguist) developed a hierarchy of formal grammars (types 0 to 3) for use in the study of English and other natural languages. Type 0 is the most powerful and type 3 the least powerful. It turns out that type-3 regular grammars are just what is needed to describe the requirements of lexical analysis, and that type-2 context-free grammars are just what is needed to describe the syntax of many programming languages. But even type 0 grammars are inadequate for describing the English language.

[edit] Analysis Tools

Programming tools have been developed which can accept the description of a grammar as input, and which produce as output the code for using that grammar for analysis. Note that there may be several different grammars which can be used to describe a particular programming language, and that not all of these grammars are usable by these tools.

Lexical analysis code using regular grammars can be produced by tools such as lex, flex, and JavaCC.

Syntax analysis code using context-free grammars can be produced by tools such as yacc, bison, and JavaCC.

Even if you have a suitable grammar, such tools only automate a relatively small part of the job of writing a compiler or interpreter. Simple lexical and syntax analysis code can in fact be written by hand without undue effort; we will see some examples in later chapters. Good hand-written code may be faster than the code produced by the tools.

[edit] Extended BNF

Words in italics (e.g. WholeNumber) are used to name grammatical concepts. It is a good idea to select names for these concepts which are semantically meaningful to a human reader, to make things easier to read and understand. The Algol 60 report used diamond brackets '<' and '>' rather than italics to flag grammatical concepts.

A vertical bar | is used to separate alternatives.

Single quotation marks are used to indicate characters to be used 'as is'.

The compound symbol ::= is used to mean 'is defined as' and a semicolon ; is used to end a definition.

Square brackets [ ] are used to indicate options and/or repetition.

  [ xxx ]     xxx is optional.
  [ yyy ]*    yyy may be repeated zero or more times.
  [ zzz ]+    zzz may be repeated one or more times.

[edit] Examples for Lexical Analysis

We will start with some simple examples which satisfy the rules for regular grammars.

The following defines a single decimal Digit as being any one of the characters from '0' to '9'.

  Digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

Understood as

  Digit is defined as: '0' OR '1' OR '2' OR '3' OR '4' OR '5' OR '6' OR '7' OR '8' OR '9'.

A similar (somewhat tedious) definition can be given for Letter to cover 'a' to 'z' and 'A' to 'Z'.

A WholeNumber is a sequence of one or more decimal digits, which can be expressed as one of two ways

First way

  WholeNumber ::= [ Digit ]+ ;

Understood as

  WholeNumber is defined as: Digit may be repeated one or more times

Second way

  WholeNumber ::= Digit [ Digit ]* ;

Understood as

  WholeNumber is defined as: Digit AND Digit repeated zero or more times.

These definitions allow whole numbers such as 007. Some people dislike the idea of allowing whole numbers to have leading zeroes. We can handle this way of looking at things by using the following definitions, but we have to include the number zero (just 0) as a special case.

  NonZeroDigit ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
 
  Digit ::= '0' | NonZeroDigit ;
 
  WholeNumber ::= '0' | NonZeroDigit [ Digit ]* ; 

Understood as

  NonZeroDigit is defined as: '1' OR '2' OR '3' OR '4' OR '5' OR '6' OR '7' OR '8' OR '9'.
  Digit is defined as: 0 OR NonZeroDigit.
  WholeNumber is defined as: 0 OR as NonZeroDigit AND Digit repeated zero or more times.

This last definition is a formal way of saying that a whole number is either 0 or starts with a non-zero digit, and that this non-zero digit may be followed by any number of digits (including none).

In most programming languages an Identifier is any sequence of letters and digits which starts with a letter; Some programming languages may allow a few special characters as well (e.g. underline _ or hash #). There may also be some restrictions on the length of an Identifier, but such restrictions are normally given as semantic constraints expressed in a natural language such as English rather than as part of the formal grammar (e.g. up to Fortran 77, Identifiers in Fortran were limited to a maximum of 6 characters).

[edit] Example for Syntax Analysis

We now look at some examples which, in combination, satisfy the rules for context-free grammars.

Consider an arithmetic expression such as 3 + 4 * 5 . The normal convention is that multiplication and division are done before addition and subtraction, so the value of 3 + 4 * 5 is 23 . If we wanted the addition to be done first in this example, we would use brackets, so that (3 + 4) * 5 has value 35 .

We can express these conventions regarding brackets and operator priority as follows. These definitions also specify that operators with the same priority are applied from left to right.

  AddSub ::= '+' | '-' ;
 
  MulDiv ::= '*' | '/' ;
 
  Expression ::= Secondary [ AddSub Secondary ]* ;
 
  Secondary  ::= Primary   [ MulDiv Primary ]* ;
 
  Primary    ::= '(' Expression ')' | WholeNumber | Identifier ;

There are some extremely simple expressions which are not covered by the above grammar. Can you work out what they might be before reading the third paragraph below?

Note that this is a relatively simple example, with only two priority levels for operators. Languages such as C/C++/Java have more than a dozen priority levels.

The particular aspect that makes this context-free rather than regular is the way in which an Expression can be a Secondary, which can be a Primary, which can be a bracketed Expression, i.e. Expression can be partially defined in terms of itself; this is what is known as a recursive definition.

The expressions which are not covered include such things as -5 or +4. To handle these we need to change one definition:

  Expression ::= [AddSub] Secondary [ AddSub Secondary ]* ;

This grammatical rule allows -(+(-4)), but it does not allow -+-4.


[edit] Glossary

This glossary is intended to provide definitions of words or phrases which relate particularly to compiling. It is not intended to provide definitions of general computing jargon, for which a reference to Wikipedia may be more appropriate.