19 June 2021

Cheap interpreter, part 1: overview

A few months ago, my former boss Neil Mitchell gave a talk entitled "Cheaply Writing a Fast Interpreter". It's a very good talk, and I encourage you to watch it. The basic premise of the talk is that it is given to people who already know many ways to write an interpreter, and the talk itself is presenting the results of a study Neil and his team did to compare a number of different techniques on their performance per cost ratio, where cost is meant in terms of development and maintenance. The results presented should be of great interest to anyone planning to write an interpreter for some practical purpose.

I don't plan to write any real interpreter. What made the talk really interesting to me is that I did not know any of the techniques Neil presented, so to me the talk served as a great roadmap of things to go play with. And then write about, of course. This is the first in a series of posts about what I learned.

Parts of an interpreter

An interpreter is a program that takes in text that forms a valid program in some programming language and executes it according to the semantics of said language. Where the text comes from is not terribly important. Starting with the point where we have some text, the major parts of an interpreter are:

  • The parser is a function that takes in some "flat" text and produces some form of structured tree.
  • The optimizer takes in the tree produced by the parser, and transforms it into another tree. There may be any number of intermediate passes over any number of intermediate trees.
  • The evaluator reads in a tree emitted by the optimizer and actually does stuff as instructed by the input program.

Naming is not very consistent across sources I've come across. "Parsing" as a transformation of text to tree is by far the most widely accepted term I've found. Coming in second would be "abstract syntax tree", in that any piece about interpreters will mention the words, though they're not always in agreement on what they mean. As mentioned, there can be any number of different trees in-between the textual representation and the one that eventually gets evaluated, and which one of these trees is the abstract syntax tree is fairly inconsistent.

Finally, in some cases the evaluator can modify the tree it is evaluating. This happens in broadly two cases:

  • If the language being evaluated is dynamic, it will have operations to modify its own tree at runtime.
  • The evaluator may modify the tree for optimization purposes. This can be called an optimizing evaluator, or a JIT (standing for "just-in-time compiler", because it's fun to make acronyms that don't cover all the words).

There are complex interplays between these things; for example, if you have a JIT, or a dynamic language, the role of the optimizer is a bit less clear. How much optimization should you do at compile time if you also have the opportunity to optimize at runtime? If you have a dynamic language, do you optimize the code you start with? What about the code you add at runtime? How do you balance the cost of running unoptimized code with the cost of running the optimizer at runtime?

Other questions can crop up, such as how much of the program the optimizer considers. Considering the whole program at once can give the optimizer a better understanding of the program and therefore find better optimizations, but it does not play well with adding code at runtime. It's also a bit impractical as it means any change requires recompiling everything, which can get slow.

Compilers

Mathematically, a compiler is any program that transforms programs written in language \(A\) into equivalent programs written in language \(B\). Compilers don't require computers at all; the concept can be (and has been) defined purely in terms of Turing machines. There is no restriction that \(B\) must be different from \(A\), or that \(B\) must be a specific language.

Still, when hearing "compiler", many programmers will think of a program that generates machine code. In a way, this is just a special case of an interpreter, where the evaluator is the CPU and the interpreter lets us serialize the result of the optimizer to disk rather than running it directly.

In a more general sense, compilers will look a lot like interpreters, in that there will be a parser that produces some tree, and then a series of steps that transform a series of trees. The evaluation part is replaced by the emission of a "flat" file that represents code in the target language.

Parsing

I will not delve into much details on the parsing side in this series, because this exploration was mostly about what happens after parsing. Still, I can give a few pointers.

In order to parse a language, one first needs to define a grammar for it, i.e. define how to form correct sentences in that language and what these sentences mean. There may be many ways to do that, but, by and large, the world has settled on something called EBNF to describe useful programming languages.

A grammar described in EBNF takes the form of a succession of rules, each of the form:

name = production

where production is either a literal (or, as a shorthand, a collection of literals in the form of a regular expression) or a combination of literals and other rules. For example, a simple grammar for arithmetic on integers could be described by:

expr = term ('+' expr)?
term = factor ('*' term)?
factor = '(' expr ')' | nat
nat = #"\d+"

where nat represents natural numbers through a regular expression, '+', '*', '(' and ')' are literal characters, (unquoted) parentheses represent grouping, question mark reprents optionality and | represents disjunction.

On the following expression: 4+5*6, this grammar would produce a parse tree along the lines of (using Clojure syntax for trees):

[:expr [:term [:factor [:nat "4"]]]
       "+"
       [:expr [:term [:factor [:nat "5"]]
                     "*"
                     [:term [:factor [:nat "6"]]]]]]

Going from an EBNF grammar to a program that can turn text into parse trees is not exactly trivial, but not very hard either. Different languages have different tools for that, but the one thing they all have in common is that you're expected to have a well-defined EBNF grammar before your start writing out your code.

It's probably important to mention that EBNF is not a single agreed-upon syntax for decribing grammars, but a loose set of things that people generally agree on in general. It's more like CSV than JSON. Every tool that reads "EBNF" will read a slightly different flavour of it.

Parser generators

In ye olden times, the grandparents of parsing technology were the tools known as lex and yacc (and later flex and bison), which are based on domain-specific languages and some notion of separating parsing (forming sentences out of words) from lexing (identifying individual words). The main parsing technology for Java, JavaCC, is a direct descendant and is based on the same ideas. The core principle seemed to be to annotate the EBNF grammar directly with snippets of code, which after a couple decades people realized is a really bad idea because it introduces way too much coupling.

More recent approaches tend to have a better separation between producing a parse tree and acting on it, which gives the option of producing a tree once and then having many different uses for it. In the Java world, ANTLR is probably the most-widely used of those. Wheres JavaCC (and yacc and bison before it) require the programmer to write a mix of Java (or C) and EBNF to execute code as the parse progresses, ANTLR takes in what is essentially "just" an EBNF grammar and generates both Java "data" classes to represent the parse tree and Java code to turn a string into said tree.

In Clojure, the wonderful instaparse library can directly turn an EBNF grammar into a parse tree; that is, in fact, what I used to produce the tree above:

t.core=> (def gram
    #_=>   "expr = term ('+' expr)?
    #_=>    term = factor ('*' term)?
    #_=>    factor = '(' expr ')' | nat
    #_=>    nat = #'\\d+'")
#'t.core/gram
t.core=> ((insta/parser gram) "4+5*6")
[:expr [:term [:factor [:nat "4"]]] "+" [:expr [:term [:factor [:nat "5"]] "*" [:term [:factor [:nat "6"]]]]]]
t.core=>

Because the language is dynamic and has good builtin default vectors and keywords, there is no code generation required. This is by far the easiest way I know of to get started with writing a parser.

Parser combinators

From the "purely functional" side comes a different way to approach writing a parser known as "parser combinators". This is essentially a monad-based DSL for parsing, where monadic sequencing is lexical sequencing.

Using Haskell notation, it can start with something like:

newtype Parser tree = Parser (String -> Maybe (tree, String))

instance Functor Parser where fmap = liftM
instance Applicative Parser where pure = return; (<*>) = ap
instance Monad Parser where
  return v = Parser (\s -> Just (v, s))
  p >>= f = Parser (\s ->
    case parse p s of
      Nothing -> Nothing
      Just (v, out) -> parse (f v) out)

item :: Parser Char
item = Parser (\s -> case s of
                       [] -> Nothing
                       (x:xs) -> Just (x, xs))

failure :: Parser a
failure = Parser (\s -> Nothing)

and then, one can fairly easily build on top of that. For example, a parser that reads three characters and discards the one in the middle could be written as:

p_1 :: Parser (Char, Char)
p_1 = do
  x <- item
  item
  y <- item
  return (x, y)

These are calld combinators because one can create new parsers by combining existing ones, either among themselves or with existing functions. For example, creating a parser for a sequence of numeric characters can be done with:

-- combinator that satisfies a boolean on a character
sat :: (Char -> Bool) -> Parser Char
sat p = do
  x <- item
  if p x then return x else failure

-- parser for a single digit
digit :: Parser Char
digit = sat isDigit

-- combinator that takes two parsers and matches the second if the first fails
(+++) :: Parser a -> Parser a -> Parser a
p +++ q = Parser (\s -> case parse p s of
                          Nothing -> parse q s
                          success -> success)

-- combinators that repeat any number of times
many :: Parser a -> Parser [a]
many p = many1 p +++ return []

many1 :: Parser a -> Parser [a]
many1 p = do
  v <- p
  vs <- many p
  return (v:vs)

-- Parser for a natural number
integer :: Parser Integer
integer = do
  xs <- many1 digit
  return (read xs)

Parser combinators do not require a separate EBNF grammar. It is possible to define the operators in such a way that the grammar, expressed as combinators, reads almost as easily as an EBNF description (adapted for the host programming language, of course). Still, it is sometimes easier to reason about the grammar on paper in EBNF and then use that to guide the code.

This approach is most popular in the Haskell and Scala communities, but parser combinator libraries exist in many other languages.

What makes grammars hard

Writing a grammar can be quite tricky, because one has to come up with rules that are not ambiguous. That is, ideally, one would like that, for any given input, there is either zero or one parse tree, but never more. It is very easy to write "ambiguous" grammars, i.e. ones that can sometimes generate more than one valid parse tree.

Another issue is non-termination. This is, in some sense, not different from any recursive code, but it's very easy to end up with a recursive grammar that does not terminate. Take the simple arithmetics grammar, for example. It could have been tempting to just go for something like:

expr = expr (('+' | '*') expr)?

and that might seem mathematically correct, but a naive derivation would yield a non-terminting program, and this grammar could generate multiple parse trees. It's also not respecting the operator priority between + and *.

Both of these issues are somewhat fundamental to the expressive power of grammars. But while they are both properties of the grammar, what's more interesting is to think of them as properties of the language. As the arithmetics example shows, there may be languages for which there are multiple ways to express the grammar, some of which have non-termination issues and some of which don't.

A lot has been written about how to design languages and the associated grammars such that they do not end up having non-termination or ambiguity problems, but that is way outside the scope of this blog entry.

Languages as trees

There is a fairly simple answer, though, which in my eyes make grammars a fairly uninteresting subject. Let's consider the following points:

  • The goal of a grammar is to turn text into trees.
  • The meaning of the program depends on the generated tree. Programmers care about the meaning of the program, and the harder it is to infer the tree from the text, the harder it is to understand the code.
  • The harder it is to infer the tree from the program, the more complicated the grammar will be, and the harder it will be to get right.

It follows that the grammar that requires the least effort to understand for users of the language is also the one that requires the least effort to implement, and it is one that has a direct mapping to a tree. In other words, it is a textual representation of a tree.

The oldest way to do that is the original Lisp syntax. This is in fact how the Lisp syntax was invented: at the time, they wanted to experiment with language semantics, not with syntax. They did have a ("non-Lisp") syntax defined, and they were planning to implement a proper parser, but given limited resources (both in implementation effort and in runtime cost) they chose to temporarily go for a direct representation of the parse tree using parentheses. They then discovered that this was actually quite a nice way to work and never got around to adding the "real" parser.

Parentheses got a bit religious in the Lisp community, but they're really beside the point. What matters is that there is a direct, easy-to-see, unambiguous correspondence between the code and the underlying parse tree. Clojure improved on the old model by realizing that and adding other types of delimiters to "de-overload" the meaning of parentheses. Elixir decided to eschew parentheses in favour of do/end. Parentheses are not the point; direct tree representation is.

As the most effective syntax is also the easiest one to parse, I find it really hard to get interested in how to write more complicated parsers for less effective syntaxes. Why make it harder for both the language implementor and the language user?

From the next post onward, I will completely ignore parsing and just assume I have a parse tree.

Tags: cheap interpreter