Compiler Construction/Syntax Analysis

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Syntax Analysis[edit | edit source]

This is alternatively known as parsing. It is roughly the equivalent of checking that some ordinary text written in a natural language (e.g. English) is grammatically correct (without worrying about meaning).

The purpose of syntax analysis or parsing is to check that we have a valid sequence of tokens. Tokens are valid sequence of symbols, keywords, identifiers etc. Note that this sequence need not be meaningful; as far as syntax goes, a phrase such as "true + 3" is valid but it doesn't make any sense in most programming languages.

The parser takes the tokens produced during the lexical analysis stage, and attempts to build some kind of in-memory structure to represent that input. Frequently, that structure is an 'abstract syntax tree' (AST).

The parser needs to be able to handle the infinite number of possible valid programs that may be presented to it. The usual way to define the language is to specify a grammar. A grammar is a set of rules (or productions) that specifies the syntax of the language (i.e. what is a valid sentence in the language).

There can be more than one grammar for a given language. Furthermore, it is easier to build parsers for some grammars than for others.

Fortunately, there exists a class of programs (often called `compiler-compilers', even though this is a silly name) that can take a grammar and generate a parser for it in some language. (Sound familiar?). Examples of such programs are:

  • ANTLR (targeting Java, C++ and C#)
  • SableCC (targeting Java)
  • YACC (targeting C)
  • YACC++ (targeting C++)
  • plus many others....

These can make the compiler-writer's job considerably easier, if they can write down the grammar for their language.

Recursive Descent Parsing[edit | edit source]

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

A predictive parser is a recursive descent parser with no backup. Predictive parsing is possible only for the class of LL(k) grammars, which are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. (The LL(k) grammars therefore exclude all ambiguous grammars, as well as all grammars that contain left recursion. Any context-free grammar can be transformed into an equivalent grammar that has no left recursion, but removal of left recursion does not always yield an LL(k) grammar.) A predictive parser runs in linear time.

Recursive descent with backup is a technique that determines which production to use by trying each production in turn. Recursive descent with backup is not limited to LL(k) grammars, but is not guaranteed to terminate unless the grammar is LL(k). Even when they terminate, parsers that use recursive descent with backup may require exponential time.

Although predictive parsers are widely used, programmers often prefer to create LR or LALR parsers via parser generators without transforming the grammar into LL(k) form.

Some authors define the recursive descent parsers as the predictive parsers. Other authors use the term more broadly, to include backed-up recursive descent.[citation needed]

Example parser[edit | edit source]

The following EBNF grammar (for Niklaus Wirth's PL/0 programming language, from Algorithms + Data Structures = Programs) is in LL(1) form (for simplicity, ident and number are assumed to be terminals):

program = block "." .

block =
    ["const" ident "=" number {"," ident "=" number} ";"]
    ["var" ident {"," ident} ";"]
    {"procedure" ident ";" block ";"} statement .

statement =
    [ident ":=" expression
    | "call" ident
    | "begin" statement {";" statement} "end"
    | "if" condition "then" statement
    | "while" condition "do" statement
    ] .

condition =
    "odd" expression
    | expression ("="|"#"|"<"|"<="|">"|">=") expression
    .

expression = ["+"|"-"] term {("+"|"-") term} .

term = factor {("*"|"/") factor} .

factor = ident | number | "(" expression ")" .

Terminals are expressed in quotes (except for ident and number). Each nonterminal is defined by a rule in the grammar.

Notice how closely the predictive parser below mirrors the grammar above. There is a procedure for each nonterminal in the grammar. Parsing descends in a top-down manner, until the final nonterminal has been processed. The program fragment depends on a global variable, sym, which contains the next symbol from the input, and the global function getsym, which updates sym when called.

typedef enum {ident, number, lparen, rparen, times, slash, plus,
   minus, eql, neq, lss, leq, gtr, geq, callsym, beginsym, semicolon,
   endsym, ifsym, whilesym, becomes, thensym, dosym, constsym, comma,
   varsym, procsym, period, oddsym} Symbol;

Symbol sym;
void getsym(void);
void error(const char msg[]);
void expression(void);

int accept(Symbol s) {
   if (sym == s) {
       getsym();
       return 1;
   }
   return 0;
}

int expect(Symbol s) {
   if (accept(s))
       return 1;
   error("expect: unexpected symbol");
   return 0;
}

void factor(void) {
   if (accept(ident)) {
       ;
   } else if (accept(number)) {
       ;
   } else if (accept(lparen)) {
       expression();
       expect(rparen);
   } else {
       error("factor: syntax error");
       getsym();
   }
}

void term(void) {
   factor();
   while (sym == times || sym == slash) {
       getsym();
       factor();
   }
}

void expression(void) {
   if (sym == plus || sym == minus)
       getsym();
   term();
   while (sym == plus || sym == minus) {
       getsym();
       term();
   }
}

void condition(void) {
   if (accept(oddsym)) {
       expression();
   } else {
       expression();
       if (sym == eql || sym == neq || sym == lss ||
           sym == leq || sym == gtr || sym == geq) {
           getsym();
           expression();
       } else {
           error("condition: invalid operator");
           getsym();
       }
   }
}

void statement(void) {
   if (accept(ident)) {
       expect(becomes);
       expression();
   } else if (accept(callsym)) {
       expect(ident);
   } else if (accept(beginsym)) {
       do {
           statement();
       } while (accept(semicolon));
       expect(endsym);
   } else if (accept(ifsym)) {
       condition();
       expect(thensym);
       statement();
   } else if (accept(whilesym)) {
       condition();
       expect(dosym);
       statement();
   }
}

void block(void) {
   if (accept(constsym)) {
       do {
           expect(ident);
           expect(eql);
           expect(number);
       } while (accept(comma));
       expect(semicolon);
   }
   if (accept(varsym)) {
       do {
           expect(ident);
       } while (accept(comma));
       expect(semicolon);
   }
   while (accept(procsym)) {
       expect(ident);
       expect(semicolon);
       block();
       expect(semicolon);
   }
   statement();
}

void program(void) {
   getsym();
   block();
   expect(period);
}

Implementation in functional languages[edit | edit source]

Recursive descent parsers are particularly easy to implement in functional languages such as Haskell or ML.

See Functional Pearls: Monadic Parsing in Haskell

See also[edit | edit source]

  • JavaCC - a recursive descent parser generator
  • Coco/R - a recursive descent parser generator
  • ANTLR - a recursive descent parser generator
  • CodeCodex: Recursive descent parsing - sample recursive descent parser generator code for C, Java and OCaml
  • Tail Recursive Parser - a variant of the recursive descent parser

Table-Driven Parsing[edit | edit source]

Operator Precedence Parsing[edit | edit source]

Operator precedence parsing is used in shift-reduce parsing. operator grammar No production has two nonterminal symbols in sequence on the right hand side. An operator grammar can be parsed using shift-reduce parsing and precedence relations between terminal symbols to find handles. This strategy is known as operator precedence.

Parsing via a Tool - yacc/bison[edit | edit source]

Wikipedia GNU bison article.

Parsing via a Tool - JavaCC[edit | edit source]

Please refer to Lexical Analysis: Scanning via a Tool - JavaCC for information on generating a token manager.

Using the generated Token manager[edit | edit source]

In order to use the generated token manager, an instance of it has to be created. When doing so, the constructor expects a Reader as the source of the input data (other sources are also possible, see the JavaCC documentation).

Once created, the token manager object can be used to get tokens via the

Token ParserNameTokenManager.getNextToken() throws ParseError;

method.

Each Token object as returned by the getNextToken() method. Such a Token object provides a field kind which identifies the token (ParserNameConstants.java defines the corresponding constants). It also contains the field image, which just holds the matched input data as a String.

example of syntax analyzer is

 #include<stdio.h>
 #include<conio.h>
 #include<string.h>
 #include<ctype.h>
 //
 void main() {
   int i,j,k=0,count,inc=0,n;
   char name[30],open[30],ch,chh,o[30];
   char op[20]={'=','+','-','*','/','%','^','&','|'};
   clrscr();
   textcolor(3);
   cprintf("--Syntax Analyser--");
   printf("\n");
   printf("\n Enter Syntax");
   printf("\n");
   scanf("%s",name);
   n=strlen(name);
   for(i=0;i<n;i++) {
     ch=tolower(name[i]);
     for(j=0;j<9;j++) {
       if(ch==op[j]) {
         open[k]=i;
         o[k]=ch;
         k++;
       }
     }
   }
   for(i=0;i<k;i++) {
     count=open[i];
     ch=tolower(name[count-1]);
     chh=tolower(name[count+1]);
     if(isalpha(ch)&&isalpha(chh)||isdigit(chh))
       ++inc;
   }
   if(k==inc)
     printf("\n %s is a valid syntax",name);
   else
     printf("\n %s is an invalid syntax",name);
   getch();
 }