Play With Lua!

A Toy Regular-Expression Matcher

without comments

There’s a book I have, Beautiful Code, which is a bunch of essays about programming by various famous programmers. It’s a great book. The chapters are on all different topics; the only thing they have in common is a piece of code that makes you think “wow, that’s really clever”. This post is about chapter one of that book, which is about a regular expression matcher.

Regular expressions are a domain-specific language for describing finite state machines. Specifically, finite state machines where the state transitions are successive characters in a string: you can use them to check if input strings match certain formats, like, a phone number: three digits followed by a dash followed by four digits. Kernighan’s matcher, which I’m going to translate to Lua and then extend a little bit, is really compact and beautiful. It’s about thirty lines, and recognizes these tokens:

  • A dot (“.”) matches any single character
  • A caret (“^”) matches the beginning of the string
  • A dollar sign (“$”) matches the end of the string
  • A star (“*”) matches zero or more repetitions of the previous token
  • Any other character (like “a”) matches itself.

My plan is first to duplicate Kernighan’s matcher in Lua and then extend it to also handle character sets, like “[abc]” would match a single “a”, “b”, or “c”. Because of this, I’m not going to port his code across directly, but rather make some changes for extensibility.

Basic matcher

Lua’s string metatable already has a match function, so we’re gonna, uh, destroy it. If you like yours feel free to call this my_match or something:

function string.match(str, pattern)
    if pattern:at(1) == "^" then
        return matchhere(str, pattern:sub(2))
    else
        repeat
            if matchhere(str, pattern) then return true end
            str = str:sub(2)
        until str == ""
    end
end

This is pretty straightforward, although you can see it calls a couple utility functions we’ll have to write. matchhere tries to find a match starting at the beginning of the string, so, if our pattern starts with a caret, we cut it off and try to match our pattern once and return. If it doesn’t, we enter a loop where we attempt to find a match starting at every character, until we find one or the string is empty. The code for matchhere is a little more complicated but not much:

function matchhere(str, pattern)
    -- Everything matches an empty pattern
    if pattern == "" then return true end
 
    -- Try to match end-of-string
    if pattern == "$" then return str == "" end
 
    local first_token, modifier, after = pattern:at(1), pattern:at(2), 2
    if modifier == "*" then after = 3 end
 
    if modifier == "*" then
        return matchstar(str, first_token, pattern:sub(after))
 
    elseif matches(str:at(1), first_token) then
        return matchhere(str:sub(2), pattern:sub(after))
    end
end

We first handle the two degenerate cases, matching against an empty pattern or an empty string. After that we want to find the token we’re going to match against (a single letter or a dot) and its modifier (a star or, well, not a star). The after variable is used to try to find the portion of the pattern left over after this token/modifier tuple. This is a difference between what I wrote and Kernighan’s code: he assumes that the tokens are one character long (either a char or a dot) and I plan to add character classes, so I can’t.

After that we either call another helper function (matchstar) or recurse. Since we directly return the result of the recursive call, this is actually not horrible: Lua supports tail-call elimination so this won’t blow out the stack. More on that later though.

The matches function is really simple so I won’t bother writing this first version, instead let’s look at matchstar:

function matchstar(str, token, rest)
    while true do
        if matchhere(str, rest) then return true end
 
        if not matches(str:at(1), token) or str == "" then break end
        str = str:sub(2) -- Next char
    end
end

We take a string, a token that’s allowed to repeat, and a tail of a pattern. In a loop, we try to match the string to the tail. If we can’t, and the string starts with the allowed-repeatable token, then that’s probably why, so we strip that off and try again.

So that’s everything you need. It can use up a few stack frames, but because it’s tail-recursive it only uses one extra one per star in the pattern, and there usually aren’t many of those. So now let’s add character classes to it.

Character classes

A character class is a set of characters inside brackets, like “[abc]”. The intent is for a character class to act like one token, but match any letter in the class, so “[abc]” would match “a” but not “d”, and “[01]*” would match “101010” but not “01210”. In order to add these, we have to modify two functions, matches (which we haven’t seen yet) and matchhere (where it finds first_token and after). Let’s do matches first:

function matches(ch, token)
    if token.chars then
        return token.chars:find(ch)
    else
        return token == "." or token == ch
    end
end

We’re going to say, now, that a token is just a string containing a single character, or a table {chars = "..."} containing a set of characters (this is to remove the ambiguity between “.” and “[.]”; we want the latter to only match a literal dot). Now we just need a way to pull the first token, whatever it is, off a pattern:

function grabtoken(pattern)
    local token, rest
 
    if pattern:at(1) == "[" then -- A char class!
        local close = pattern:find("]")
        token = {chars = pattern:sub(2, close-1)}
        rest = pattern:sub(close+1)
    else -- Not a class
        token = pattern:at(1)
        rest = pattern:sub(2)
    end
 
    local modifier = rest:at(1)
    if modifier == "*" then rest = rest:sub(2) end
 
    return token, modifier, rest
end

We look at the first character to tell if this is a class (which will end in a close brace) or a normal character. Once we pull off the token part, we determine what the rest of the pattern is after it, which we may cut a character (a star) off of later. Then we return the token, the char right after the token (in case it’s a star), and the rest of the pattern.

We can pretty much plug this directly into matchhere:

local first_token, modifier, rest = grabtoken(pattern)

It returns exactly what we’d expect. I did change matchhere to want the rest of the pattern instead of the index of the rest of the pattern, just to make it a little cleaner.

Demonstration

It works pretty much the same way a subset of Lua’s built-in matcher works. Here’s an example of using it to determine if a string of binary digits is even:

assert(("10010110100"):match("^[01]*0$"))

The real payoff here is in seeing how easy it is to add more features to it. Adding named character classes (“\d” for all digits, say) is easy; more useful would be to add escapes, so we can match a literal dot. More useful and a lot harder would be adding captures. What have you always wished regular expressions would do? Play around with it, see what you can add. The code (with unit tests using lt) is here.

Written by randrews

June 10th, 2011 at 1:16 am

Posted in Uncategorized