## Making a toy programming language in Lua, part 3

This is part three of my series on writing a toy programming language using LPeg. You should start with part one here.

Last time, we added variables and arrays to the language. Assigning to arrays was starting to strain our design some, so this time we’ll refactor it a lot, and add two features that would be impossible without that refactoring: conditional statements and loops.

## Abstract syntax trees

Right now, our parser evaluates expressions as soon as it can, while it’s still in the process of parsing them. This was fine while we were just parsing mathematical expressions, and it’s even fine when we start to include variables (as long as assignment statements can’t appear inside expressions), but conditionals and loops break that: we need to *parse* the entire “if” statement, but we only want the body to *happen* after we evaluate the condition. With our current parser, we would evaluate the body (and cause the side-effects of any assignment statements it contains) as soon as we parse it, so those statements would run whether the condition was true or false.

In order to fix this, we first need to do a major overhaul of our language. I’m going to part with *The Unix Programming Environment* a little here: they jumped right to compilation and code generation, whereas I’m going to do that, but not immediately: it’s more common to do what I’m going to do, and build an *abstract syntax tree* first.

So, the first thing we need to do is decide how to represent an AST. Let’s have each node be a table with index 1 being the type of node, and the rest being the captures. So, the AST for “2+3” would be:

{ "expr", { "term", 2 }, "+", { "term", 3 } }

We can construct this easily enough. Actually much more easily than if we were using Yacc. We’ll use two more LPeg functions, `Cc`

and `Ct`

. `Ct`

is a *table capture,* which wraps up all the captures for a pattern into a table, and `Cc`

is a *constant capture,* which consumes none of the input but captures its argument. By taking a nonterminal like EXPR, which looks like this:

EXPR = V("TERM") * ( expr_op * V("TERM") )^0 / eval,

…Removing the function captures (because we don’t want to evaluate it yet), wrapping it up in `Ct`

, and prefacing it with `Cc`

to tag it:

EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ),

…we can make our parser generate an AST. The way LPeg can build this directly into the grammar is really quite neat, I think.

When we do this to the entire grammar, here’s what we’re left with:

stmt = spc * P{ "STMT"; STMT = Ct( Cc("assign") * V("REF") * "=" * spc * V("VAL") ) + V("EXPR"), EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ), TERM = Ct( Cc("term") * V("FACT") * ( term_op * V("FACT") )^0 ), REF = Ct( Cc("ref") * name * (lbrack * V("EXPR") * rbrack)^0 ), FACT = number + lparen * V("EXPR") * rparen + V("REF"), ARRAY = Ct( Cc("array") * lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack ), VAL_LIST = V("VAL") * (comma * V("VAL"))^0, VAL = V("EXPR") + V("ARRAY") }

As you can see, we’ve only changed a few of them, because some things are simple enough not to need their own nodes: numbers and REFs don’t really need to be wrapped in FACT nodes, because a FACT will always just evaluate to its only child, for example. Same with STMT, we only really care if it’s an assignment or an EXPR, because a STMT that contains an EXPR just evaluates that EXPR.

So, now, trying to match “2+3” yields the exact tree that we’d expect:

-- stmt:match("2+3") gives: { "expr", { "term", 2 }, "+", { "term", 3 } }

## Evaluating an AST

Obviously we’ll need to change the `eval`

function to handle these. In fact, let’s split it up into several functions. TERMs and EXPRs can share a function that’s a lot like the current `eval`

:

function eval_expr(expr) local accum = eval(expr[2]) -- because 1 is "expr" for i = 3, #expr, 2 do local operator = expr[i] local num2 = eval(expr[i+1]) if operator == '+' then accum = accum + num2 elseif operator == '-' then accum = accum - num2 elseif operator == '*' then accum = accum * num2 elseif operator == '/' then accum = accum / num2 end end return accum end

We’re still doing the same basic thing, with the loop and accumulator. The main differences are that we start at index 2 (because index 1 is the name of the node, either “expr” or “term”), and that all the subvalues, `num1`

and `num2`

, get sent to `eval`

before being used. This recursion lets us handle EXPRs that have nested EXPRs or TERMs in them.

The `eval`

function itself becomes fairly simple as well. There aren’t many things we can possibly send it:

function eval(ast) if type(ast) == 'number' then return ast elseif ast[1] == 'expr' or ast[1] == 'term' then return eval_expr(ast) elseif ast[1] == 'array' then local new = {} for _, el in ipairs(ast[2]) do table.insert(new, eval(el)) end return new elseif ast[1] == 'ref' then return lookup(ast) elseif ast[1] == 'assign' then return assign(ast[2], eval(ast[3])) end end

Numbers are returned untouched, expressions and terms get sent to `eval_expr`

, and arrays are recursively evaluated and returned in a table. References and assignments are a little more complicated.

## Variables in an AST

To evaluate a REF, we need to follow a chain of indices: the first element in the chain is a variable name, but the variable could be an array, in which case there will be more elements, until finally you reach a number. So, the `lookup`

function to do this, which has a lot in common with `makeref`

from last time:

function lookup(ref) local current = VARS for i = 2, #ref do local next_index = ref[i] if type(next_index) == 'table' then next_index = eval(next_index) end current = current[next_index] end return current end

At each step, the next index can be a variable name, a number, or an expression that evaluates to a number. So, if it’s a table, it must be an expression, and we `eval`

it.

An assignment is pretty much the same: we take a ref and a value, follow down to one before the end of the chain of indices, and then set the value:

function assign(ref, value) local current = VARS for i = 2, #ref do local next_index = ref[i] if type(next_index) == 'table' then next_index = eval(next_index) end if i == #ref then -- last one, set the value current[next_index] = value return value else -- not the last, keep following the chain current = current[next_index] end end end

So, with these changes, and one minor tweak in our `test`

function (add `stmt = stmt / eval`

to the top of it), all our tests will pass again. Now we’re ready to start adding functionality to use the AST.

## Conditionals

First, let’s modify the parser to handle conditionals. We’ll need to add three new concepts: a statement list (that will go inside the conditional), a boolean expression (that will be the predicate of the conditional), and the conditional itself. Booleans seem easy enough to start with. Match the possible operators:

boolean = C( S("<>") + "<=" + ">=" + "!=" + "==" ) * spc

Then add another nonterminal to the parser:

BOOL = Ct( Cc("bool") * V("EXPR") * boolean * V("EXPR") )

Statement lists are easy enough, and they’ll become our new starting nonterminal, replacing STMT:

LIST = V("STMT") + Ct( Cc("list") * lcurly * V("STMT") * ( ";" * spc * V("STMT") )^0 * rcurly ),

So a LIST is either a single STMT, or a series of them inside braces and separated by semicolons (`lcurly`

and `rcurly`

match left and right curly braces). Finally, here’s how we’ll define a conditional:

IF = Ct( C("if") * spc * lparen * V("BOOL") * rparen * V("LIST") )

All together, here’s our parser now:

stmt = spc * P{ "LIST"; LIST = V("STMT") + Ct( Cc("list") * lcurly * V("STMT") * ( ";" * spc * V("STMT") )^0 * rcurly ), STMT = Ct( Cc("assign") * V("REF") * "=" * spc * V("VAL") ) + V("EXPR") + V("IF"), EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ), TERM = Ct( Cc("term") * V("FACT") * ( term_op * V("FACT") )^0 ), REF = Ct( Cc("ref") * name * (lbrack * V("EXPR") * rbrack)^0 ), FACT = number + lparen * V("EXPR") * rparen + V("REF"), ARRAY = Ct( Cc("array") * lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack ), VAL_LIST = V("VAL") * (comma * V("VAL"))^0, VAL = V("EXPR") + V("ARRAY"), BOOL = Ct( Cc("bool") * V("EXPR") * boolean * V("EXPR") ), IF = Ct( C("if") * spc * lparen * V("BOOL") * rparen * V("LIST") ) }

There’s one final tweak to make. An “if” statement will actually be parsed as a name, and `eval`

will try to look up the variable called “if.” We need to tell the `name`

pattern to not recognize “if” as a name; this is the first *keyword* in our language (there will be more):

name = C( letter * (digit+letter+"_")^0 ) * spc keywords = P("if") * spc name = name - keywords

Parsing an example “if” statement like “if(a > 0) { 2+3; a=4*5 }” now gives this parse tree:

{ "if", { "bool", { "expr", { "term", { "ref", "a" } } }, ">", { "expr", { "term", 0 } } }, { "list", { "expr", { "term", 2 }, "+", { "term", 3 } }, { "assign", { "ref", "a" }, { "expr", { "term", 4, "*", 5 } } } } }

## Evaluating conditionals

This is all obviously going to mean some changes to the `eval`

function. We’ll have to add a function like `eval_expr`

to handle boolean expressions, and two new options in `eval`

to handle statement lists and “if” statements. Here’s how we do booleans:

function eval_bool(expr) local num1 = eval(expr[2]) local operator = expr[3] local num2 = eval(expr[4]) if operator == '<' then return num1 < num2 elseif operator == '<=' then return num1 <= num2 elseif operator == '>' then return num1 > num2 elseif operator == '>=' then return num1 >= num2 elseif operator == '==' then return num1 == num2 elseif operator == '!=' then return num1 ~= num2 end end

This is a lot simpler than `eval_expr`

because we know exactly how many arguments there will be: we can’t chain together booleans like we can EXPRs and TERMs.

The two new paths through `eval`

differ from the rest in one way: they’re not guaranteed to return anything. Statement lists and conditionals can’t be nested inside of other value-returning constructs (like EXPRs) so we won’t ever have to return a value from those to `eval`

itself. Here’s the new `eval`

:

function eval(ast) if type(ast) == 'number' then return ast elseif ast[1] == 'expr' or ast[1] == 'term' then return eval_expr(ast) elseif ast[1] == 'array' then local new = {} for _, el in ipairs(ast[2]) do table.insert(new, eval(el)) end return new elseif ast[1] == 'ref' then return lookup(ast) elseif ast[1] == 'assign' then return assign(ast[2], eval(ast[3])) elseif ast[1] == 'list' then for i = 2, #ast do eval(ast[i]) end elseif ast[1] == 'if' then if eval_bool(ast[2]) then return eval(ast[3]) end end end

Evaluating a list just means evaluating everything in it, and evaluating an “if” means `eval_bool`

-ing the condition, then evaluating the body *only if it’s true*. This means we can now stick this in the bottom of `test`

and it will pass:

stmt:match("if(1 < 0) b = 5"); assert(VARS.b ~= 5)

Because 1 is not less than 0, the part of the AST that does the assignment is never run, and nothing gets assigned to “b”.

## Loops

We can implement loops almost the same way. First we modify the parser to parse loops, which is already mostly done, since it’s the same general thing as conditionals:

WHILE = Ct( C("while") * spc * lparen * V("BOOL") * rparen * V("LIST") )

Add “while” to the list of keywords:

keywords = (P("if")+P("while")) * spc

And we should be able to parse loops:

while(n<10) { n = n+1 }

turns into

{ "while", { "bool", { "expr", { "term", { "ref", "n" } } }, "<", { "expr", { "term", 10 } } }, { "list", { "assign", { "ref", "n" }, { "expr", { "term", { "ref", "n" } }, "+", { "term", 1 } } } } }

Now we just need the `eval`

function to actually run the loops. We'll do this in much the same way as the conditionals. Add a case to `eval`

to handle "while" nodes:

function eval(ast) -- stuff elided -- elseif ast[1] == 'while' then while eval_bool(ast[2]) do eval(ast[3]) end end end

And now we can write a unit test for this:

VARS.n=0; VARS.x=1 stmt:match("while(n < 8) { x = x * 2; n = n + 1 }") assert(VARS.x == 256)

## Next steps

So, now that we've separated parsing our language from running programs we've parsed, we can easily add new features like control structures. Most new features you might want to add follow this basic pattern: modify the parser to recognize it, then modify the evaluator to run the parse tree. And that's mostly what we're going to do next time too: we'll modify the parser to recognize function definitions, and we'll modify the evaluator to store and call them.

As always, the completed code for this chapter is available here.