Building a Compiler in .Net (2 of 4)
This post is part of a series of 4. Here you have the links to each one:
- Building a Compiler in .Net (1 of 4)
- Building a Compiler in .Net (2 of 4)
- Building a Compiler in .Net (3 of 4)
- Building a Compiler in .Net (4 of 4)
Introduction
In this second post we will implement the two first steps of a compiler: Scanner and Parser. Remember from part 1 a Scanner, also known as Lexical analyser, reads the characters from source code file and compose tokens. This tokens will be taken by Parser (or Syntax Analyser) and will validate it’s order using a grammar, generating the syntax-tree.
Very Simple Language
It seems a good moment to write some lines about the language we will be able to compile. It’s called Very Simple Language (VSL for friends). We will be able to declare variables without specifying it’s type and make assignments, make conditional statements (if-then-else
), loops (while
), show information of screen (print
) and read from console (read_int
and read_str
). As you can see we will not have functions or procedures, neither dynamic variable. I think it’s the minimum expression of a “hight-level programming language”.
The grammar for VSL is:
< instr_stmt > ::=
| var < identifier > = < expr_stmt >
| < identifier > = < expr_stmt >
| read_int < identifier >
| read_str < identifier >
| print < expr_stmt >
| < instr_stmt >; < instr_stmt >
| if < expr_stmt > then < instr_stmt > { else < instr_stmt > } endif
| while < expr_stmt > do < instr_stmt > endwhile
< expr_stmt > ::= < bool_stmt > { < bool_op > < bool_stmt > }
< bool_op > ::= == | <>
< bool_stmt > ::= < term_stmt > { < term_op > < term_stmt > }
< term_op > ::= + | -
< term_stmt > ::= < factor_stmt > { < factor_op > < factor_stmt > }
< factor_op > ::= * | /
< factor_stmt > ::=
| < variable >
| < number_lt >
| < string_lt >
| ( < expr_stm > )
< variable > ::= < char > < ident_rest >*
< ident_rest > ::= < char > | < digit >
< number_lt > ::= < digit >+
< digit > ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
< string_lt > ::= " < string_elem >* "
< string_elem > ::= < any char other than " >
Take care with the grammar. A good grammar will be longer than this one: Or numbers can’t be negative, we haven’t specified how big the numbers can be (32 bit integer?). A true and complete BNF grammar definition would precisely define all these details. By the way, in our VSLCompiler v0 this will be our grammar (If I have time, in the future, I will try to improve the grammar and the compiler for VSL).
Here we can see a sample of a VSL program:
var times = 0;
print "Type the number:";
read_int times;
var cnt = 0;
while cnt <> times do
print "lambda-function";
cnt = cnt + 1;
endwhile;
Implementation
Scanner
The primary job of the Scanner is to break the input text in tokens. It’s in charge to determinate which tokens will be sent to the Parser so it’s able to throw out things that are not defined in the grammar (like tabs, comments,…).
So, at the end, we need to define with symbols will be interpreted as operands (in VSL case: +
, -
, *
, /
, =
, ==
, <>
) or special operands ((
, )
, ;
):
The Scanner implemented for VSL is as simple as the language. The structure of the class is the following one:
With this structure we get an “input text stream” with a TextReader
object and we will process it in Scan
method. This method is, basically, a loop that walks over every character trying to recognise it according to the grammar. Every time a character is recognised a token will be formed and pushed into the tokens-list result
. We assume that when we find a character "
it will encapsulate a string token.
At this point we have all the tokens in result
, ready to be sent to Parser.
Parser
The Parser developed for Very Simple Language has a umber of jobs: The main one is to ensure source program conforms to the grammar specification (if not it raises an error). It also creates the syntax-tree, the unique representation of the source code, consumed by Code Generator, and figures out what runtime types to use.
On the first hand we are going to take a look the the syntax-tree structure. In my mind I have developed the syntax-tree as a instanced-class tree, lets see:
I prefer to think syntax-tree keeps the structure of the source code according to the language’s grammar. Sure you can find a different way to implement the syntax-tree (may be an abstract syntax-tree).
There are many algorithms available for parsing implementation. The definition of our grammar implies we will use a LL strategy (Left-to-right, Left-most derivation) and a top-down parser (for more information you can see LR, and bottom-up grammars). At the end, LL, simple means that it reads text from left to down, building the syntax-tree based on the next available token.
The constructor for my Parser class is as simply as Scanner ones. It takes a list of tokens that was created by the scanner:
The core of the parsing work is done by ParseInstr
method. As we can see in the code before it returns a InstrStmt
node (which serves as a root node of the syntax-tree). The Parser traverses the list of tokens using an index as the current position. For each token it tries to assign to an “Instruction case”, in other case it raises an error. In each “instruction case” the ParseInstr
will analyses the statement using the other Parsing methods (ParseExprs
, ParseBool
, ParseTerm
, ParseFactr
).
We are going to see an example of ParseInstr: print 3;
.
Firstly we see the ParseIntr
statement to parse this order:
The print
token is discarded and we call ParseExpr
to obtain the expression needed after the token print
(grammar rules). We follow the call-list until ParseFactr
. The implementation of ParseFactr
is:
Here, at the lowest level of the parsing function we define with is the operand of each operation. In this case we return a int literal.
Consecutive instructions are allows using the “Squence token”, implemented in ParseInstr
as:
This sequence node contains two pointers to InstrStmt
nodes and forms the basis of the syntax-tree structure.