Saturday, April 11, 2009

Writing a Compiler with F# Part 4

Now that we've defined the AST for our language and written a simple evaluator, let's make it easier to construct ASTs. Obviously we don't create C# programs by directly constructing the syntax trees; instead, we type in text which is lexed and then parsed into the tree.

It turns out that creating lexers and parsers is such a common language task that there are many tools available to generate them for us from some basic definitions. One common lexer generator is called Lex and F# ships with a version of it called fslex. Likewise, fsyacc is the F# variant of the Yacc parser generators. I'm not going to talk about the different types of grammars as those are well documented elsewhere, suffice to say that fsyacc is capable of generating parsers that can handle languages you're probably familiar with.

Recall from last time that lexers take a stream of characters and produce tokens. The do this by matching a regular expression against the input to identify each character. If we assume we have some tokens defined for the elements of our language like PRINT, INT, SEMI (for a ; to separate statements in a statement list), PLUS, MINUS, LPAREN, and RPAREN, we can easily define the regular expressions that produce these tokens from the input.

Adding an .fsl file to the project allows us to define the lexer rules. When the project is built, fslex will run against our file and produce a generated F# file that contains our lexer. Here's what the definition looks like:

open System
open Parser
open Lexing
let num = ['0'-'9']+
let integer = '-'? num
let whitespace = ' ' | '\t'
let newline = '\n' | '\r' '\n'

rule token = parse
| whitespace { token lexbuf }
| newline { (lexbuf: lexbuf).EndPos <- lexbuf.EndPos.NextLine; token lexbuf }
| "(" { LPAREN }
| ")" { RPAREN }
| "+" { PLUS }
| "-" { MINUS }
| ";" { SEMI }
| integer { INT (Int32.Parse(lexeme lexbuf)) }
| eof { EOF }
| "print" { PRINT }

The initial { } contains F# code that is directly inserted into our generated file. The tokens themselves (anything in all caps) are defined in the parser which we'll see in a minute so we have to "open Parser" to be able to see them. Next we set up some basic patterns including "num" for strings of digits.

Then, we define the token production rules. The left side (immediately following "|") defines the pattern we are matching, while the braces contain the action to execute (and token to return) should the pattern be matched. Note that "whitespace" results in simply calling our "token" function again with the remaining input, effectively ignoring the whitespace. Also, "newline" updates the current line number so should there be a problem the error can correctly report the line it was on. The only other point of interest is that the INT token carries along a piece of integer data (the actual number that was typed).

Armed with our lexer, we can now create an fsy file and build our parser. It looks like this:

open Ast

%start start

%token <int> INT


%type <stmt> start


start: StmtList EOF { Seq(List.rev $1) }

| INT { Int $1 }
| Expr PLUS Expr { BinOp (Plus, $1, $3) }
| Expr MINUS Expr { BinOp (Minus, $1, $3) }
| LPAREN Expr RPAREN { $2 }

| PRINT Expr { Print $2 }

| Stmt { [$1] }
| StmtList SEMI Stmt { $3 :: $1 }

Like the lexer, the parser starts with a code block that simply opens the Ast module with the Ast types defined in it. "%start" indicates the name given to the generated parsing function, while "%type" later on indicates that the function will return a "stmt". Next we have a list of tokens (note that INT is defined to carry a piece of data of type int) and an indication that PLUS and MINUS are left associative. This also defines precedence so we can do things like correct multiplication and division later. Finally we have a list of patterns to match. Like the lexer, the left side matches the input (although now we're matching tokens rather than raw characters) and the right side (in the braces) indicates which type defined in Ast should be returned. The "$x" escapes refer to the tokens matched. For example, Expr PLUS Expr results in a BinOp where the left hand side is the value of the first token ($1), i.e., the first "Expr" and the right hand side is the value of the third token ($3), or the second "Expr". The PLUS token itself is "$2". The last trick is that we end up building the StmtList backwards, for efficiency sake (prepending an element to a list is O(1) while adding to the end is O(n)), so in the "start:" pattern we have to reverse $1 before returning the Seq.

Now we can use our parser with something like this:

let parseText text =
let lexbuf = Lexing.from_string text
Parser.start Lexer.token lexbuf

Calling "parseText" with a valid program ("print (1 + 1)") will result in an instance of "stmt" that we can pass to the evaluator we build in the last article.

Now we have a pretty stable base to work from. Next time we'll add variables and functions and then after that we'll add types other than "int" so we can start to worry about type checking.

1 comment:

Alexander Beletsky said...

that was good serios of post.. you should consider to go on :)