27 June 2021

Cheap interpreter, part 2: tree-walking

Last week I described the general structure of an interpreter, and gave a cursory introduction to parsing. From this point on, I will simply assume I am starting with a parse tree, and ignore how it was produced.

This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.

The following topics, while important, are out of scope:

  • Parsing. We're assuming we start with a parse tree.
  • Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
  • Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
  • JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.

Code sample

The talk uses a specific code snippet as the running example across all approaches. We'll do the same. Here is the snippet as given in the talk:

x = 100;
for (i = 1000; i != 0; i--) {
  x = x + 4 + x + 3;
  x = x + 2 + 4;
}
x

Parse tree

The talk presents all of its approaches in Rust, but I don't know Rust so I made my explorations in a mix of Haskell and Clojure. I think there are interesting things to say about both, so, when relevant, I'll present both in this series.

The parse trees are going to look superficially different, but they encode the exact same structure. In Clojure, all I could do is show you an example, and hope you can infer the general, underlying structure. In Haskell, however, I can describe the structure itself. Here it is:

data Op
  = Add
  | NotEq
  deriving Show

data Exp
 = Lit Int
 | Var Int
 | Set Int Exp
 | Bin Op Exp Exp
 | Do Exp Exp
 | While Exp Exp
  deriving Show

This is not meant to be a complete language; it is only meant to cover the specific operations we see in our one snippet. We have, in order:

  • Lit i: literal values; for simplicity, we assume all values in the language are integers.
  • Var i: access a variable from the environment. Again, for simplicity, we assume variable names are integers. This can always be done, and usually is, because indexing by number is faster than looking up by string.
  • Set i e: set the value of variable i to be the result of evaluating expression e. For example, Set 0 (Lit 3) would represent the instruction x = 3.
  • Bin op e1 e2: apply the binary operator op to the results of evaluating expressions e1 and e2, for example Bin Add (Lit 3) (Lit 4) would represent 3 + 4.
  • Do e1 e2 is our main mean of sequencing operations. One could be tempted to represent a list of operations as a list, but representing it as a tree-shaped structure makes writing code against it easier. Concrete parse trees become a bit more cumbersome, but keep in mind that in general they are produced by the parser. Being easier to produce and consume is better than being easier to write down by hand, or to read as a human.
  • While condition body is our looping mechanism: we'll first evaluate the condition, and if it evaluates to 1, we'll evaluate the body and come back to condition.

We only need two binary operations for this snippet, so we could have just made two entries in Exp called Add Exp Exp and NotEq Exp Exp, but, as for Do, isolating binary operations makes writing code against the representation easier, because it makes the Exp type simpler. The tradeoff is that it makes Exp values more complex, but that is a tradeoff worth making.

With that representation, our code snippet will look like:

ast :: Exp
ast =
  -- x = 100
  (Do (Set 0 (Lit 100))
      -- i = 1000
      (Do (Set 1 (Lit 1000))
          -- for (; i != 0;)
          (Do (While (Bin NotEq (Lit 0)
                                (Var 1))
                     -- x = (((x + 4) + x) + 3)
                     (Do (Set 0 (Bin Add (Bin Add (Bin Add (Var 0)
                                                           (Lit 4))
                                                  (Var 0))
                                         (Lit 3)))
                         -- x = ((x + 2) + 4)
                         (Do (Set 0 (Bin Add (Bin Add (Var 0)
                                                      (Lit 2))
                                             (Lit 4)))
                             -- i = i + (-1)
                             (Set 1 (Bin Add (Lit (-1))
                                             (Var 1))))))
              -- return x
              (Var 0))))

Walking the tree

Let's start with the most straightforward approach. First, what are we trying to do? We want a function that takes in an Exp, but then what? We could try to handle side effects, but for the sake of simplicity here we'll just assume our programs return an Int. This would yield a function of type Exp -> Int.

While our language does not have "external" effects, such as printing, it does have "internal" effects, like setting the value of a variable. We'll need to represent that somehow. The most direct way to represent that is to carry around an environment, which we'll represent as a Map Int Int with insert and lookup wrapper functions:

newtype Env = Env (Data.Map Int Int)
  deriving Show

lookup :: Env -> Int -> Int
lookup (Env m) n = maybe bottom id (Data.Map.lookup n m)

insert :: Env -> Int -> Int -> Env
insert (Env m) n v = Env $ Data.Map.insert n v m

We'll also need an easy way to start from an empty one:

mt_env :: Env
mt_env = Env Data.Map.empty

It's been argued that the most natural way to traverse a tree is to use a recursive function. I strongly dispute the notion that there is anything "natural" about pretty much any programming approach, but recursion certainly is a good practical way of walking down a tree. Our recursion will need to carry over two pieces of state:

  1. The expression we (still) need to evaluate, and
  2. The current environment, i.e. the current value of all our variables.

As a first attempt, one could try thinking about the recursive loop type as Exp -> Env -> Int. That would not work, though, because we will have cases where we want to go down one child of the current node, collecting state changes on the way, and then process the next child of the current node using the updated state. In our case, the state is represented entirely by Env, so our inner loop type will be Exp -> Env -> (Int, Env).

If you have read my series on monads, this description should be enough to get you started thinking about a monadic approach. We'll get to that, but for now let's start with a more direct approach.

If we want our top-level API to be Exp -> Int and need our recursion to be Exp -> Env -> (Int, Env), we need an inner helper function, which we'll call loop, as its main purpose is to loop.

This gives us the following structure for our interpreter:

naive_ast_walk :: Exp -> Int
naive_ast_walk ex =
  let (r, _) = loop ex mt_env
  in r
  where
  loop :: Exp -> Env -> (Int, Env)
  loop exp0 env0 =
    case exp0 of
      Lit v -> undefined
      Var n -> undefined
      Set n exp1 -> undefined
      Bin op e1 e2 -> undefined
      Do first rest -> undefined
      While condition body -> undefined

All that is left is to give a precise meaning for each node type in our parse tree. Let's start with the easiest one: Lit i:

      Lit v -> (v, env0)

We don't change the environment, and we simply return the literal value we're given. Var n is just a little bit more complex, as we need to extract the value from the environment. Fortunately, we've already defined an abstraction for that around our Env representation, so it is not all that much harder:

      Var n -> (lookup env0 n, env0)

Next up is Set n exp1, which is the first operation where we need to recurse on a subtree. The important part here is that the right-hand-side of the assignment could, in principle, have side-effects, so we need to be careful not to drop the new state on the floor:

      Set n exp1 -> let (v, env1) = loop exp1 env0
                    in (v, insert env1 n v)

This is about as complex as it gets. The next operation, Bin, just applies the same recursion mechanism twice:

      Bin op e1 e2 -> do
        let (v1, env1) = loop e1 env0
        let (v2, env2) = loop e2 env1
        ((bin op) v1 v2, env2)

where we have defined bin as:

bin :: Op -> Int -> Int -> Int
bin = \case
  Add -> (+)
  NotEq -> \v1 v2 -> if v1 /= v2 then 1 else 0

Do is just a simpler version of Bin: we still recurse twice, but this time we don't need to keep the result from the first recursion (though we do still need the state from it):

      Do first rest -> do
        let (_, env1) = loop first env0
        loop rest env1

Finally, our While loop needs to evaluate its condition, then optionally evaluate its body and start over.

      While condition body -> do
        let (c, env1) = loop condition env0
        if c == 1
        then do
          let (_, env2) = loop body env1
          loop exp0 env2
        else (bottom, env1)

where bottom is defined as

bottom :: Int
bottom = undefined

Aside: a note on language design

Note that there are a few important language design choices being made here:

  • We have decided not to have booleans, and implicitly defined 1 as being true and any other integer value as being false.
  • We have decided that the result of a while loop is to crash, lazily. That means that if someone wrote something like this:
    a = while (...) {...}
    b = a + 1
    

    we would crash, at runtime, on the second line.

  • We have also chosen to define the return value of an assignment as the assigned value. More generally, we do not distinguish between statements and expressions.

To be clear, these are, overall, pretty bad choices, and if you are trying to design your own language, I am not advocating for these particular design choices. This exercise, however, is not about language design, but about exploring different ways to write an interpreter. Language design is a mostly orthogonal concern, and the semantics we have here, while not great for a practical language, are good enough for our purpose.

A monadic approach

As mentioned, this notion of traversing a tree while keeping some state around is a perfect use-case for monads. Here is an equivalent, monad-based implementation of the same tree-walking strategy:

data EvalExec a where
  EvalBind :: EvalExec a -> (a -> EvalExec b) -> EvalExec b
  EvalReturn :: a -> EvalExec a
  EvalLookup :: Int -> EvalExec Int
  EvalSet :: Int -> Int -> EvalExec ()

instance Functor EvalExec where fmap = liftM
instance Applicative EvalExec where pure = return; (<*>) = ap
instance Monad EvalExec where return = EvalReturn; (>>=) = EvalBind

-- name stands for "tree walking evaluator, monadic version"
twe_mon :: Exp -> Int
twe_mon exp =
  exec (eval exp) mt_env (\_ r -> r)
  where
  eval :: Exp -> EvalExec Int
  eval = \case
    Lit v -> return v
    Var n -> do
      v <- EvalLookup n
      return v
    Set n exp -> do
      v <- eval exp
      EvalSet n v
      return v
    Bin op e1 e2 -> do
      v1 <- eval e1
      v2 <- eval e2
      return $ (bin op) v1 v2
    Do first rest -> do
      _ <- eval first
      eval rest
    While condition body -> do
      c <- eval condition
      if 1 == c
      then do
        _ <- eval body
        eval (While condition body)
      else return bottom

  exec :: EvalExec a -> Env -> (Env -> a -> Int) -> Int
  exec m env cont = case m of
    EvalBind prev step -> exec prev env (\env ret -> exec (step ret) env cont)
    EvalReturn v -> cont env v
    EvalLookup n -> cont env (lookup env n)
    EvalSet n v -> cont (insert env n v) ()

The implementation of exec here relies on a technique called "continuation-passing style", commonly abbreviated CPS, which will be discussed in more depth in a future part of this series.

Overall, I would say the extra complexity of using a monad in this case does not pay for itself. The amount of state we have to thread through, and the amount of operations that we want to execute on it, are both small enough that the code is overall simpler with the direct, non-monadic approach.

This may not remain the case if we made the language more complex, for example by adding other types of side effects than setting variables. Like all abstractions, monads can be very useful, or just add cognitive and runtime overhead, depending on when and where they are used.

What about performance?

Speaking of overhead, the goal of this series is to, following the agenda of Neil's talk, compare and contrast different ways to write an interpreter on their complexity to performance ratio. This requires some way to measure performance, and possibly some baseline to compare to.

At this point, we already have two implementations, so this can be a good time to look into building some form of benchmarking process. This will be the subject of the next part in this series.

Tags: cheap interpreter