F Sharp Programming/Lexing and Parsing
F# : Lexing and Parsing |
Lexing and parsing is a very handy way to convert source-code (or other human-readable input which has a well-defined syntax) into an abstract syntax tree (AST) which represents the source-code. F# comes with two tools, FsLex and FsYacc, which are used to convert input into an AST.
FsLex and FsYacc have more or less the same specification as OCamlLex and OCamlYacc, which in turn are based on the Lex and Yacc family of lexer/parser generators. Virtually all material concerned with OCamlLex/OCamlYacc can transfer seamlessly over to FsLex/FsYacc. With that in mind, SooHyoung Oh's OCamlYacc tutorial and companion OCamlLex Tutorial are the single best online resources to learn how to use the lexing and parsing tools which come with F# (and OCaml for that matter!).
Lexing and Parsing from a High-Level View
[edit | edit source]Transforming input into a well-defined abstract syntax tree requires (at minimum) two transformations:
- A lexer uses regular expressions to convert each syntactical element from the input into a token, essentially mapping the input to a stream of tokens.
- A parser reads in a stream of tokens and attempts to match tokens to a set of rules, where the end result maps the token stream to an abstract syntax tree.
It is certainly possible to write a lexer which generates the abstract syntax tree directly, but this only works for the most simplistic grammars. If a grammar contains balanced parentheses or other recursive constructs, optional tokens, repeating groups of tokens, operator precedence, or anything which can't be captured by regular expressions, then it is easiest to write a parser in addition to a lexer.
With F#, it is possible to write custom file formats, domain specific languages, and even full-blown compilers for your new language.
Extended Example: Parsing SQL
[edit | edit source]The following code will demonstrate step-by-step how to define a simple lexer/parser for a subset of SQL. If you're using Visual Studio, you should add a reference to FSharp.PowerPack.dll
to your project. If you're compiling on the commandline, use the -r
flag to reference the aforemented F# powerpack assembly.
Step 1: Define the Abstract Syntax Tree
[edit | edit source]We want to parse the following SQL statement:
SELECT x, y, z
FROM t1
LEFT JOIN t2
INNER JOIN t3 ON t3.ID = t2.ID
WHERE x = 50 AND y = 20
ORDER BY x ASC, y DESC, z
This is a pretty simple query, and while it doesn't demonstrate everything you can do with the language, it's a good start. We can model everything in this query using the following definitions in F#:
// Sql.fs
type value =
| Int of int
| Float of float
| String of string
type dir = Asc | Desc
type op = Eq | Gt | Ge | Lt | Le // =, >, >=, <, <=
type order = string * dir
type where =
| Cond of (value * op * value)
| And of where * where
| Or of where * where
type joinType = Inner | Left | Right
type join = string * joinType * where option // table name, join, optional "on" clause
type sqlStatement =
{ Table : string;
Columns : string list;
Joins : join list;
Where : where option;
OrderBy : order list }
A record type neatly groups all of our related values into a single object. When we finish our parser, it should be able to convert a string in an object of type sqlStatement
.
Step 2: Define the parser tokens
[edit | edit source]A token is any single identifiable element in a grammar. Let's look at the string we're trying to parse:
SELECT x, y, z
FROM t1
LEFT JOIN t2
INNER JOIN t3 ON t3.ID = t2.ID
WHERE x = 50 AND y = 20
ORDER BY x ASC, y DESC, z
So far, we have several keywords (by convention, all keywords are uppercase): SELECT
, FROM
, LEFT
, INNER
, RIGHT
, JOIN
, ON
, WHERE
, AND
, OR
, ORDER
, BY
, ASC
, DESC
, and COMMA
.
There are also a few comparison operators: EQ
, GT
, GE
, LT
, LE
.
We also have non-keyword identifiers composed of strings and numeric literals. which we’ll represent using the keyword ID
, INT
, FLOAT
.
Finally, there is one more token, EOF
, which indicates the end of our input stream.
Now we can create a basic parser file for FsYacc, name the file SqlParser.fsp:
%{
open Sql
%}
%token <string> ID
%token <int> INT
%token <float> FLOAT
%token AND OR
%token COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
%token SELECT FROM WHERE ORDER BY
%token ASC DESC
%token EOF
// start
%start start
%type <string> start
%%
start:
| { "Nothing to see here" }
%%
This is boilerplate code with the section for tokens filled in.
Compile the parser using the following command line: fsyacc SqlParser.fsp
- Tips:
- If you haven't done so already, you should add the F# bin directory to your PATH environment variable.
- If you're using Visual Studio, you can automatically generate your parser code on each compile. Right-click on your project file and choose "Properties". Navigate to the Build Events tab, and add the following to the 'Pre-build event command line' and use the following:
fsyacc "$(ProjectDir)SqlParser.fsp"
. Also remember to exclude this file from the build process: right-click the file, choose "Properties" and select "None" against "Build Action".
If everything works, FsYacc will generate two files, SqlParser.fsi and SqlParser.fs. You'll need to add these files to your project if they don't already exist. If you open the SqlParser.fsi file, you'll notice the tokens you defined in your .fsl file have been converted into a union type.
Step 3: Defining the lexer rules
[edit | edit source]Lexers convert text inputs into a stream of tokens. We can start with the following boiler plate code:
{
// header: any valid F# can appear here.
open Lexing
}
// regex macros
let char = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let int = '-'?digit+
let float = '-'?digit+ '.' digit+
let whitespace = [' ' '\t']
let newline = "\n\r" | '\n' | '\r'
// rules
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { lexbuf.EndPos <- lexbuf.EndPos.NextLine; tokenize lexbuf; }
| eof { () }
This is not "real" F# code, but rather a special language used by FsLex.
The let
bindings at the top of the file are used to define regular expression macros. eof
is a special marker used to identify the end of a string buffer input.
rule ... = parse ...
defines our lexing function, called tokenize
above. Our lexing function consists of a series of rules, which has two pieces: 1) a regular expression, 2) an expression to evaluate if the regex matches, such as returning a token. Text is read from the token stream one character at a time until it matches a regular expression and returns a token.
We can fill in the remainder of our lexer by adding more matching expressions:
{
open Lexing
// opening the SqlParser module to give us access to the tokens it defines
open SqlParser
}
let char = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let int = '-'?digit+
let float = '-'?digit+ '.' digit+
let identifier = char(char|digit|['-' '_' '.'])*
let whitespace = [' ' '\t']
let newline = "\n\r" | '\n' | '\r'
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { lexbuf.EndPos <- lexbuf.EndPos.NextLine; tokenize lexbuf; }
| int { INT(Int32.Parse(lexeme lexbuf)) }
| float { FLOAT(Double.Parse(lexeme lexbuf)) }
| ',' { COMMA }
| eof { EOF }
Notice the code between the {
's and }
's consists of plain old F# code. Also notice we are returning the same tokens (INT
, FLOAT
, COMMA
and EOF
) that we defined in SqlParser.fsp. As you can probably infer, the code lexeme lexbuf
returns the string our parser matched. The tokenize
function will be converted into function which has a return type of SqlParser.token
.
We can fill in the rest of the lexer rules fairly easily:
{
open System
open SqlParser
open Lexing
let keywords =
[
"SELECT", SELECT;
"FROM", FROM;
"WHERE", WHERE;
"ORDER", ORDER;
"BY", BY;
"JOIN", JOIN;
"INNER", INNER;
"LEFT", LEFT;
"RIGHT", RIGHT;
"ASC", ASC;
"DESC", DESC;
"AND", AND;
"OR", OR;
"ON", ON;
] |> Map.ofList
let ops =
[
"=", EQ;
"<", LT;
"<=", LE;
">", GT;
">=", GE;
] |> Map.ofList
}
let char = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let int = '-'?digit+
let float = '-'?digit+ '.' digit+
let identifier = char(char|digit|['-' '_' '.'])*
let whitespace = [' ' '\t']
let newline = "\n\r" | '\n' | '\r'
let operator = ">" | ">=" | "<" | "<=" | "="
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { lexbuf.EndPos <- lexbuf.EndPos.NextLine; tokenize lexbuf; }
| int { INT(Int32.Parse(lexeme lexbuf)) }
| float { FLOAT(Double.Parse(lexeme lexbuf)) }
| operator { ops.[lexeme lexbuf] }
| identifier { match keywords.TryFind(lexeme lexbuf) with
| Some(token) -> token
| None -> ID(lexeme lexbuf) }
| ',' { COMMA }
| eof { EOF }
Notice we've created a few maps, one for keywords and one for operators. While we certainly can define these as rules in our lexer, its generally recommended to have a very small number of rules to avoid a "state explosion".
To compile this lexer, execute the following code on the commandline: fslex SqlLexer.fsl
. (Try adding this file to your project's Build Events as well.) Then, add the file SqlLexer.fs to the project. We can experiment with the lexer now with some sample input:
open System
open Sql
let x = "
SELECT x, y, z
FROM t1
LEFT JOIN t2
INNER JOIN t3 ON t3.ID = t2.ID
WHERE x = 50 AND y = 20
ORDER BY x ASC, y DESC, z
"
let lexbuf = Lexing.from_string x
while not lexbuf.IsPastEndOfStream do
printfn "%A" (SqlLexer.tokenize lexbuf)
Console.WriteLine("(press any key)")
Console.ReadKey(true) |> ignore
This program will print out a list of tokens matched by the string above.
Step 4: Define the parser rules
[edit | edit source]A parser converts a stream of tokens into an abstract syntax tree. We can modify our boilerplate parser as follows (will not compile):
%{
open Sql
%}
%token <string> ID
%token <int> INT
%token <float> FLOAT
%token AND OR
%token COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
%token SELECT FROM WHERE ORDER BY
%token ASC DESC
%token EOF
// start
%start start
%type <Sql.sqlStatement> start
%%
start: SELECT columnList
FROM ID
joinList
whereClause
orderByClause
EOF {
{ Table = $4;
Columns = $2;
Joins = $5;
Where = $6;
OrderBy = $7 }
}
value:
| INT { Int($1) }
| FLOAT { Float($1) }
| ID { String($1) }
%%
Let's examine the start:
function. You can immediately see that we have a list of tokens which gives a rough outline of a select statement. In addition to that, you can see the F# code contained between {
's and }
's which will be executed when the code successfully matches—in this case, its returning an instance of the Sql.sqlStatement
record.
The F# code contains "$1", "$2", :$3", etc. which vaguely resembles regex replace syntax. Each "$#" corresponds to the index (starting at 1) of the token in our matching rule. The indexes become obvious when they’re annotated as follows:
(* $1 $2 *)
start: SELECT columnList
(* $3 $4 *)
FROM ID
(* $5 *)
joinList
(* $6 *)
whereClause
(* $7 *)
orderByClause
(* $8 *)
EOF {
{ Table = $4;
Columns = $2;
Joins = $5;
Where = $6;
OrderBy = $7 }
}
So, the start
rule breaks our tokens into a basic shape, which we then use to map to our sqlStatement
record. You're probably wondering where the columnList
, joinList
, whereClause
, and orderByClause
come from—these are not tokens, but are rather additional parse rules which we'll have to define. Let’s start with the first rule:
columnList:
| ID { [$1]}
| ID COMMA columnList { $1 :: $3 }
columnList
matches text in the style of "a, b, c, ... z
" and returns the results as a list. Notice this rule is defined recursively (also notice the order of rules is not significant). FsYacc's match algorithm is "greedy", meaning it will try to match as many tokens as possible. When FsYacc receives an ID
token, it will match the first rule, but it also matches part of the second rule as well. FsYacc then performs a one-token lookahead: it the next token is a COMMA
, then it will attempt to match additional tokens until the full rule can be satisfied.
- Note: The definition of columnList above is not tail recursive, so it may throw a stack overflow exception for exceptionally large inputs. A tail recursive version of this rule can be defined as follows:
start: ... EOF { { Table = $4; Columns = List.rev $2; Joins = $5; Where = $6; OrderBy = $7 } } columnList: | ID { [$1]} | columnList COMMA ID { $3 :: $1 }
- The tail-recursive version creates the list backwards, so we have to reverse when we return our final output from the parser.
We can treat the JOIN clause in the same way, however its a little more complicated:
// join clause
joinList:
| { [] } // empty rule, matches 0 tokens
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
joinOnClause:
| { None }
| ON conditionList { Some($2) }
conditionList:
| value op value { Cond($1, $2, $3) }
| value op value AND conditionList { And(Cond($1, $2, $3), $5) }
| value op value OR conditionList { Or(Cond($1, $2, $3), $5) }
joinList
is defined in terms of several functions. This results because there are repeating groups of tokens (such as multiple tables being joined) and optional tokens (the optional "ON" clause). You've already seen that we handle repeating groups of tokens using recursive rules. To handle optional tokens, we simply break the optional syntax into a separate function, and create an empty rule to represent 0 tokens.
With this strategy in mind, we can write the remaining parser rules:
%{
open Sql
%}
%token <string> ID
%token <int> INT
%token <float> FLOAT
%token AND OR
%token COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
%token SELECT FROM WHERE ORDER BY
%token ASC DESC
%token EOF
// start
%start start
%type <Sql.sqlStatement> start
%%
start: SELECT columnList
FROM ID
joinList
whereClause
orderByClause
EOF {
{ Table = $4;
Columns = List.rev $2;
Joins = $5;
Where = $6;
OrderBy = $7 }
}
columnList:
| ID { [$1]}
| columnList COMMA ID { $3 :: $1 }
// join clause
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
joinOnClause:
| { None }
| ON conditionList { Some($2) }
conditionList:
| value op value { Cond($1, $2, $3) }
| value op value AND conditionList { And(Cond($1, $2, $3), $5) }
| value op value OR conditionList { Or(Cond($1, $2, $3), $5) }
// where clause
whereClause:
| { None }
| WHERE conditionList { Some($2) }
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
value:
| INT { Int($1) }
| FLOAT { Float($1) }
| ID { String($1) }
// order by clause
orderByClause:
| { [] }
| ORDER BY orderByList { $3 }
orderByList:
| orderBy { [$1] }
| orderBy COMMA orderByList { $1 :: $3 }
orderBy:
| ID { $1, Asc }
| ID ASC { $1, Asc }
| ID DESC { $1, Desc}
%%
Step 5: Piecing Everything Together
[edit | edit source]Here is the complete code for our lexer/parser:
SqlParser.fsp
%{
open Sql
%}
%token <string> ID
%token <int> INT
%token <float> FLOAT
%token AND OR
%token COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
%token SELECT FROM WHERE ORDER BY
%token ASC DESC
%token EOF
// start
%start start
%type <Sql.sqlStatement> start
%%
start: SELECT columnList
FROM ID
joinList
whereClause
orderByClause
EOF {
{ Table = $4;
Columns = List.rev $2;
Joins = $5;
Where = $6;
OrderBy = $7 }
}
columnList:
| ID { [$1]}
| columnList COMMA ID { $3 :: $1 }
// join clause
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
joinOnClause:
| { None }
| ON conditionList { Some($2) }
conditionList:
| value op value { Cond($1, $2, $3) }
| value op value AND conditionList { And(Cond($1, $2, $3), $5) }
| value op value OR conditionList { Or(Cond($1, $2, $3), $5) }
// where clause
whereClause:
| { None }
| WHERE conditionList { Some($2) }
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
value:
| INT { Int($1) }
| FLOAT { Float($1) }
| ID { String($1) }
// order by clause
orderByClause:
| { [] }
| ORDER BY orderByList { $3 }
orderByList:
| orderBy { [$1] }
| orderBy COMMA orderByList { $1 :: $3 }
orderBy:
| ID { $1, Asc }
| ID ASC { $1, Asc }
| ID DESC { $1, Desc}
%%
SqlLexer.fsl
{
open System
open SqlParser
open Lexing
let keywords =
[
"SELECT", SELECT;
"FROM", FROM;
"WHERE", WHERE;
"ORDER", ORDER;
"BY", BY;
"JOIN", JOIN;
"INNER", INNER;
"LEFT", LEFT;
"RIGHT", RIGHT;
"ASC", ASC;
"DESC", DESC;
"AND", AND;
"OR", OR;
"ON", ON;
] |> Map.of_list
let ops =
[
"=", EQ;
"<", LT;
"<=", LE;
">", GT;
">=", GE;
] |> Map.of_list
}
let char = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let int = '-'?digit+
let float = '-'?digit+ '.' digit+
let identifier = char(char|digit|['-' '_' '.'])*
let whitespace = [' ' '\t']
let newline = "\n\r" | '\n' | '\r'
let operator = ">" | ">=" | "<" | "<=" | "="
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { lexbuf.EndPos <- lexbuf.EndPos.NextLine; tokenize lexbuf; }
| int { INT(Int32.Parse(lexeme lexbuf)) }
| float { FLOAT(Double.Parse(lexeme lexbuf)) }
| operator { ops.[lexeme lexbuf] }
| identifier { match keywords.TryFind(lexeme lexbuf) with
| Some(token) -> token
| None -> ID(lexeme lexbuf) }
| ',' { COMMA }
| eof { EOF }
Program.fs
open System
open Sql
let x = "
SELECT x, y, z
FROM t1
LEFT JOIN t2
INNER JOIN t3 ON t3.ID = t2.ID
WHERE x = 50 AND y = 20
ORDER BY x ASC, y DESC, z
"
let lexbuf = Lexing.from_string x
let y = SqlParser.start SqlLexer.tokenize lexbuf
printfn "%A" y
Console.WriteLine("(press any key)")
Console.ReadKey(true) |> ignore
Altogether, our minimal SQL lexer/parser is about 150 lines of code (including non-trivial lines of code and whitespace). I'll leave it as an exercise for the reader to implement the remainder of the SQL language spec.
2011-03-06: I tried the above instructions with VS2010 and F# 2.0 and PowerPack 2.0. I had to make a few changes:
- Add "module SqlLexer" on the 2nd line of SqlLexer.fsl
- Change Map.of_list to Map.ofList
- Add " --module SqlParser" to the command line of fsyacc
- Add FSharp.PowerPack to get Lexing module
2011-07-06: (Sjuul Janssen) These where the steps I had to take in order to make this work.
If you get the message "Expecting a LexBuffer<char> but given a LexBuffer<byte> The type 'char' does not match the type 'byte'"
- Add "fslex.exe "$(ProjectDir)SqlLexer.fsl" --unicode" to the pre-build
- in program.fs change "let lexbuf = Lexing.from_string x" to "let lexbuf = Lexing.LexBuffer<_>.FromString x"
- in SqlLexer.fsi change "lexeme lexbuf" to "LexBuffer<_>.LexemeString lexbuf"
If you get the message that some module doesn't exist or that some module is declared multiple times. Make sure that in the solution explorer the files come in this order:
- Sql.fs
- SqlParser.fsp
- SqlLexer.fsl
- SqlParser.fsi
- SqlParser.fs
- SqlLexer.fs
- Program.fs
If you get the message "Method not found: 'System.Object Microsoft.FSharp.Text.Parsing.Tables`1.Interpret(Microsoft.FSharp.Core.FSharpFunc`2<Microsoft.FSharp.Text.Lexing.LexBuffer`1<Byte>,!0>, ..." Go to http://www.microsoft.com/download/en/details.aspx?id=15834 and reinstall Visual Studio 2010 F# 2.0 Runtime SP1 (choose for repair)
2011-07-06: (mkduffi) Could someone please provide a sample project. I have followed all of your changes but still can not build. Thanks.
Sample
[edit | edit source]https://github.com/obeleh/FsYacc-Example
2011-07-07 (mkduffi) Thanks for posting the sample. Here is what I did to the Program.fs file:
namespace FS
module Parser =
open System
open Sql
let x = "
SELECT x, y, z
FROM t1
LEFT JOIN t2
INNER JOIN t3 ON t3.ID = t2.ID
WHERE x = 50 AND y = 20
ORDER BY x ASC, y DESC, z
"
let ParseSql x =
let lexbuf = Lexing.LexBuffer<_>.FromString x
let y = SqlParser.start SqlLexer.tokenize lexbuf
y
let y = (ParseSql x)
printfn "%A" y
Console.WriteLine("(press any key)")
Console.ReadKey(true) |> ignore
I added a C# console project for testing and this is what is in the Program.cs file:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string query =
@"SELECT x, y, z
FROM t1
LEFT JOIN t2
INNER JOIN t3 ON t3.ID = t2.ID
WHERE x = 50 AND y = 20
ORDER BY x ASC, y DESC, z";
Sql.sqlStatement stmnt = FS.Parser.ParseSql(query);
}
};
}
I had to add the YaccSample project reference as well as a reference to the FSharp.Core assembly to get this to work.
If anyone could help me figure out how to support table aliases that would be awesome.
Thanks
2011-07-08 (Sjuul Janssen) Contact me through my github account. I'm working on this and some other stuff.