Play With Lua!

Making a toy programming language in Lua, part 3

without comments

This is part three of my series on writing a toy programming language using LPeg. You should start with part one here.

Last time, we added variables and arrays to the language. Assigning to arrays was starting to strain our design some, so this time we’ll refactor it a lot, and add two features that would be impossible without that refactoring: conditional statements and loops.

Abstract syntax trees

Right now, our parser evaluates expressions as soon as it can, while it’s still in the process of parsing them. This was fine while we were just parsing mathematical expressions, and it’s even fine when we start to include variables (as long as assignment statements can’t appear inside expressions), but conditionals and loops break that: we need to parse the entire “if” statement, but we only want the body to happen after we evaluate the condition. With our current parser, we would evaluate the body (and cause the side-effects of any assignment statements it contains) as soon as we parse it, so those statements would run whether the condition was true or false.

In order to fix this, we first need to do a major overhaul of our language. I’m going to part with The Unix Programming Environment a little here: they jumped right to compilation and code generation, whereas I’m going to do that, but not immediately: it’s more common to do what I’m going to do, and build an abstract syntax tree first.

So, the first thing we need to do is decide how to represent an AST. Let’s have each node be a table with index 1 being the type of node, and the rest being the captures. So, the AST for “2+3” would be:

{ "expr",
  { "term", 2 },
  "+",
  { "term", 3 }
}

We can construct this easily enough. Actually much more easily than if we were using Yacc. We’ll use two more LPeg functions, Cc and Ct. Ct is a table capture, which wraps up all the captures for a pattern into a table, and Cc is a constant capture, which consumes none of the input but captures its argument. By taking a nonterminal like EXPR, which looks like this:

EXPR = V("TERM") * ( expr_op * V("TERM") )^0 / eval,

…Removing the function captures (because we don’t want to evaluate it yet), wrapping it up in Ct, and prefacing it with Cc to tag it:

EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ),

…we can make our parser generate an AST. The way LPeg can build this directly into the grammar is really quite neat, I think.

When we do this to the entire grammar, here’s what we’re left with:

stmt = spc * P{
    "STMT";
    STMT = 
        Ct( Cc("assign") * V("REF") * "=" * spc * V("VAL") ) +
        V("EXPR"),
    EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ),
    TERM = Ct( Cc("term") * V("FACT") * ( term_op * V("FACT") )^0 ),
    REF = Ct( Cc("ref") * name * (lbrack * V("EXPR") * rbrack)^0 ),
    FACT =
        number +
        lparen * V("EXPR") * rparen +
        V("REF"),
    ARRAY = Ct( Cc("array") * lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack ),
    VAL_LIST = V("VAL") * (comma * V("VAL"))^0,
    VAL = V("EXPR") + V("ARRAY")
}

As you can see, we’ve only changed a few of them, because some things are simple enough not to need their own nodes: numbers and REFs don’t really need to be wrapped in FACT nodes, because a FACT will always just evaluate to its only child, for example. Same with STMT, we only really care if it’s an assignment or an EXPR, because a STMT that contains an EXPR just evaluates that EXPR.

So, now, trying to match “2+3” yields the exact tree that we’d expect:

-- stmt:match("2+3") gives:
 
{ "expr",
  { "term", 2 },
  "+",
  { "term", 3 }
}

Evaluating an AST

Obviously we’ll need to change the eval function to handle these. In fact, let’s split it up into several functions. TERMs and EXPRs can share a function that’s a lot like the current eval:

function eval_expr(expr)
    local accum = eval(expr[2]) -- because 1 is "expr"
    for i = 3, #expr, 2 do
        local operator = expr[i]
        local num2 = eval(expr[i+1])
 
        if operator == '+' then
            accum = accum + num2
        elseif operator == '-' then
            accum = accum - num2
        elseif operator == '*' then
            accum = accum * num2
        elseif operator == '/' then
            accum = accum / num2
        end
    end
    return accum
end

We’re still doing the same basic thing, with the loop and accumulator. The main differences are that we start at index 2 (because index 1 is the name of the node, either “expr” or “term”), and that all the subvalues, num1 and num2, get sent to eval before being used. This recursion lets us handle EXPRs that have nested EXPRs or TERMs in them.

The eval function itself becomes fairly simple as well. There aren’t many things we can possibly send it:

function eval(ast)
    if type(ast) == 'number' then
        return ast
    elseif ast[1] == 'expr' or ast[1] == 'term' then
        return eval_expr(ast)
    elseif ast[1] == 'array' then
        local new = {}
        for _, el in ipairs(ast[2]) do
            table.insert(new, eval(el))
        end
        return new
    elseif ast[1] == 'ref' then
        return lookup(ast)
    elseif ast[1] == 'assign' then
        return assign(ast[2], eval(ast[3]))
    end
end

Numbers are returned untouched, expressions and terms get sent to eval_expr, and arrays are recursively evaluated and returned in a table. References and assignments are a little more complicated.

Variables in an AST

To evaluate a REF, we need to follow a chain of indices: the first element in the chain is a variable name, but the variable could be an array, in which case there will be more elements, until finally you reach a number. So, the lookup function to do this, which has a lot in common with makeref from last time:

function lookup(ref)
    local current = VARS
    for i = 2, #ref do
        local next_index = ref[i]
        if type(next_index) == 'table' then
            next_index = eval(next_index)
        end
 
        current = current[next_index]
    end
    return current
end

At each step, the next index can be a variable name, a number, or an expression that evaluates to a number. So, if it’s a table, it must be an expression, and we eval it.

An assignment is pretty much the same: we take a ref and a value, follow down to one before the end of the chain of indices, and then set the value:

function assign(ref, value)
    local current = VARS
    for i = 2, #ref do
        local next_index = ref[i]
        if type(next_index) == 'table' then
            next_index = eval(next_index)
        end
 
        if i == #ref then -- last one, set the value
            current[next_index] = value
            return value
        else -- not the last, keep following the chain
            current = current[next_index]
        end
    end
end

So, with these changes, and one minor tweak in our test function (add stmt = stmt / eval to the top of it), all our tests will pass again. Now we’re ready to start adding functionality to use the AST.

Conditionals

First, let’s modify the parser to handle conditionals. We’ll need to add three new concepts: a statement list (that will go inside the conditional), a boolean expression (that will be the predicate of the conditional), and the conditional itself. Booleans seem easy enough to start with. Match the possible operators:

boolean = C( S("<>") + "<=" + ">=" + "!=" + "==" ) * spc

Then add another nonterminal to the parser:

BOOL = Ct( Cc("bool") * V("EXPR") * boolean * V("EXPR") )

Statement lists are easy enough, and they’ll become our new starting nonterminal, replacing STMT:

LIST =
    V("STMT") +
    Ct( Cc("list") *
            lcurly *
            V("STMT") * ( ";" * spc * V("STMT") )^0 *
            rcurly ),

So a LIST is either a single STMT, or a series of them inside braces and separated by semicolons (lcurly and rcurly match left and right curly braces). Finally, here’s how we’ll define a conditional:

IF = Ct( C("if") * spc * lparen * V("BOOL") * rparen * V("LIST") )

All together, here’s our parser now:

stmt = spc * P{
    "LIST";
    LIST =
        V("STMT") +
        Ct( Cc("list") *
                lcurly *
                V("STMT") * ( ";" * spc * V("STMT") )^0 *
                rcurly ),
    STMT = 
        Ct( Cc("assign") * V("REF") * "=" * spc * V("VAL") ) +
        V("EXPR") +
        V("IF"),
    EXPR = Ct( Cc("expr") * V("TERM") * ( expr_op * V("TERM") )^0 ),
    TERM = Ct( Cc("term") * V("FACT") * ( term_op * V("FACT") )^0 ),
    REF = Ct( Cc("ref") * name * (lbrack * V("EXPR") * rbrack)^0 ),
    FACT =
        number +
        lparen * V("EXPR") * rparen +
        V("REF"),
    ARRAY = Ct( Cc("array") * lbrack * Ct( V("VAL_LIST")^-1 ) * rbrack ),
    VAL_LIST = V("VAL") * (comma * V("VAL"))^0,
    VAL = V("EXPR") + V("ARRAY"),
    BOOL = Ct( Cc("bool") * V("EXPR") * boolean * V("EXPR") ),
    IF = Ct( C("if") * spc * lparen * V("BOOL") * rparen * V("LIST") )
}

There’s one final tweak to make. An “if” statement will actually be parsed as a name, and eval will try to look up the variable called “if.” We need to tell the name pattern to not recognize “if” as a name; this is the first keyword in our language (there will be more):

name = C( letter * (digit+letter+"_")^0 ) * spc
keywords = P("if") * spc
name = name - keywords

Parsing an example “if” statement like “if(a > 0) { 2+3; a=4*5 }” now gives this parse tree:

{ "if",
  { "bool",
    { "expr",
      { "term", { "ref", "a" } } },
      ">",
      { "expr", { "term", 0 } }
    },
  { "list",
    { "expr",
      { "term", 2 }, "+", { "term", 3 }
    },
    { "assign",
      { "ref", "a" },
      { "expr",
        { "term", 4, "*", 5 }
      }
    }
  }
}

Evaluating conditionals

This is all obviously going to mean some changes to the eval function. We’ll have to add a function like eval_expr to handle boolean expressions, and two new options in eval to handle statement lists and “if” statements. Here’s how we do booleans:

function eval_bool(expr)
    local num1 = eval(expr[2])
    local operator = expr[3]
    local num2 = eval(expr[4])
 
    if operator == '<' then
        return num1 < num2
    elseif operator == '<=' then
        return num1 <= num2
    elseif operator == '>' then
        return num1 > num2
    elseif operator == '>=' then
        return num1 >= num2
    elseif operator == '==' then
        return num1 == num2
    elseif operator == '!=' then
        return num1 ~= num2
    end
end

This is a lot simpler than eval_expr because we know exactly how many arguments there will be: we can’t chain together booleans like we can EXPRs and TERMs.

The two new paths through eval differ from the rest in one way: they’re not guaranteed to return anything. Statement lists and conditionals can’t be nested inside of other value-returning constructs (like EXPRs) so we won’t ever have to return a value from those to eval itself. Here’s the new eval:

function eval(ast)
    if type(ast) == 'number' then
        return ast
    elseif ast[1] == 'expr' or ast[1] == 'term' then
        return eval_expr(ast)
    elseif ast[1] == 'array' then
        local new = {}
        for _, el in ipairs(ast[2]) do
            table.insert(new, eval(el))
        end
        return new
    elseif ast[1] == 'ref' then
        return lookup(ast)
    elseif ast[1] == 'assign' then
        return assign(ast[2], eval(ast[3]))
    elseif ast[1] == 'list' then
        for i = 2, #ast do
            eval(ast[i])
        end
    elseif ast[1] == 'if' then
        if eval_bool(ast[2]) then
            return eval(ast[3])
        end
    end
end

Evaluating a list just means evaluating everything in it, and evaluating an “if” means eval_bool-ing the condition, then evaluating the body only if it’s true. This means we can now stick this in the bottom of test and it will pass:

stmt:match("if(1 < 0) b = 5"); assert(VARS.b ~= 5)

Because 1 is not less than 0, the part of the AST that does the assignment is never run, and nothing gets assigned to “b”.

Loops

We can implement loops almost the same way. First we modify the parser to parse loops, which is already mostly done, since it’s the same general thing as conditionals:

WHILE = Ct( C("while") * spc * lparen * V("BOOL") * rparen * V("LIST") )

Add “while” to the list of keywords:

keywords = (P("if")+P("while")) * spc

And we should be able to parse loops:

while(n<10) {
  n = n+1
}

turns into

{ "while",
  { "bool",
    { "expr", { "term", { "ref", "n" } } },
    "<",
    { "expr", { "term", 10 } }
  },
  { "list",
    { "assign",
      { "ref", "n" },
      { "expr",
        { "term", { "ref", "n" } },
        "+",
        { "term", 1 }
      }
    }
  }
}

Now we just need the eval function to actually run the loops. We'll do this in much the same way as the conditionals. Add a case to eval to handle "while" nodes:

function eval(ast)
    -- stuff elided --
    elseif ast[1] == 'while' then
        while eval_bool(ast[2]) do
            eval(ast[3])
        end
    end
end

And now we can write a unit test for this:

VARS.n=0; VARS.x=1
stmt:match("while(n < 8) { x = x * 2; n = n + 1 }")
assert(VARS.x == 256)

Next steps

So, now that we've separated parsing our language from running programs we've parsed, we can easily add new features like control structures. Most new features you might want to add follow this basic pattern: modify the parser to recognize it, then modify the evaluator to run the parse tree. And that's mostly what we're going to do next time too: we'll modify the parser to recognize function definitions, and we'll modify the evaluator to store and call them.

As always, the completed code for this chapter is available here.

Written by randrews

June 12th, 2015 at 11:41 pm

Posted in Uncategorized