## Cheap interpreter, part 2: tree-walking

Last week I described the general structure of an interpreter, and gave a cursory introduction to parsing. From this point on, I will simply assume I am starting with a parse tree, and ignore how it was produced.

This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.

The following topics, while important, are out of scope:

- Parsing. We're assuming we start with a parse tree.
- Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
- Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
- JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.

### Code sample

The talk uses a specific code snippet as the running example across all approaches. We'll do the same. Here is the snippet as given in the talk:

```
x = 100;
for (i = 1000; i != 0; i--) {
x = x + 4 + x + 3;
x = x + 2 + 4;
}
x
```

### Parse tree

The talk presents all of its approaches in Rust, but I don't know Rust so I made my explorations in a mix of Haskell and Clojure. I think there are interesting things to say about both, so, when relevant, I'll present both in this series.

The parse trees are going to look superficially different, but they encode the exact same structure. In Clojure, all I could do is show you an example, and hope you can infer the general, underlying structure. In Haskell, however, I can describe the structure itself. Here it is:

```
data Op
= Add
| NotEq
deriving Show
data Exp
= Lit Int
| Var Int
| Set Int Exp
| Bin Op Exp Exp
| Do Exp Exp
| While Exp Exp
deriving Show
```

This is not meant to be a complete language; it is only meant to cover the specific operations we see in our one snippet. We have, in order:

`Lit i`

: literal values; for simplicity, we assume all values in the language are integers.`Var i`

: access a variable from the environment. Again, for simplicity, we assume variable*names*are integers. This can always be done, and usually is, because indexing by number is faster than looking up by string.`Set i e`

: set the value of variable`i`

to be the result of evaluating expression`e`

. For example,`Set 0 (Lit 3)`

would represent the instruction`x = 3`

.`Bin op e1 e2`

: apply the binary operator`op`

to the results of evaluating expressions`e1`

and`e2`

, for example`Bin Add (Lit 3) (Lit 4)`

would represent`3 + 4`

.`Do e1 e2`

is our main mean of sequencing operations. One could be tempted to represent a list of operations as a list, but representing it as a tree-shaped structure makes writing code against it easier. Concrete parse trees become a bit more cumbersome, but keep in mind that in general they are produced by the parser. Being easier to produce and consume is better than being easier to write down by hand, or to read as a human.`While condition body`

is our looping mechanism: we'll first evaluate the`condition`

, and if it evaluates to 1, we'll evaluate the`body`

and come back to`condition`

.

We only need two binary operations for this snippet, so we could have just made
two entries in `Exp`

called `Add Exp Exp`

and `NotEq Exp Exp`

, but, as for
`Do`

, isolating binary operations makes writing code against the representation
easier, because it makes the `Exp`

type simpler. The tradeoff is that it makes
`Exp`

*values* more complex, but that is a tradeoff worth making.

With that representation, our code snippet will look like:

```
ast :: Exp
ast =
-- x = 100
(Do (Set 0 (Lit 100))
-- i = 1000
(Do (Set 1 (Lit 1000))
-- for (; i != 0;)
(Do (While (Bin NotEq (Lit 0)
(Var 1))
-- x = (((x + 4) + x) + 3)
(Do (Set 0 (Bin Add (Bin Add (Bin Add (Var 0)
(Lit 4))
(Var 0))
(Lit 3)))
-- x = ((x + 2) + 4)
(Do (Set 0 (Bin Add (Bin Add (Var 0)
(Lit 2))
(Lit 4)))
-- i = i + (-1)
(Set 1 (Bin Add (Lit (-1))
(Var 1))))))
-- return x
(Var 0))))
```

### Walking the tree

Let's start with the most straightforward approach. First, what are we trying
to do? We want a function that takes in an `Exp`

, but then what? We could try
to handle side effects, but for the sake of simplicity here we'll just assume
our programs return an `Int`

. This would yield a function of type `Exp -> Int`

.

While our language does not have "external" effects, such as printing, it does
have "internal" effects, like setting the value of a variable. We'll need to
represent that somehow. The most direct way to represent that is to carry
around an environment, which we'll represent as a `Map Int Int`

with `insert`

and `lookup`

wrapper functions:

```
newtype Env = Env (Data.Map Int Int)
deriving Show
lookup :: Env -> Int -> Int
lookup (Env m) n = maybe bottom id (Data.Map.lookup n m)
insert :: Env -> Int -> Int -> Env
insert (Env m) n v = Env $ Data.Map.insert n v m
```

We'll also need an easy way to start from an empty one:

```
mt_env :: Env
mt_env = Env Data.Map.empty
```

It's been argued that the most natural way to traverse a tree is to use a recursive function. I strongly dispute the notion that there is anything "natural" about pretty much any programming approach, but recursion certainly is a good practical way of walking down a tree. Our recursion will need to carry over two pieces of state:

- The expression we (still) need to evaluate, and
- The current environment, i.e. the current value of all our variables.

As a first attempt, one could try thinking about the recursive loop type as
`Exp -> Env -> Int`

. That would not work, though, because we will have cases
where we want to go down one child of the current node, *collecting state
changes on the way*, and then process the next child of the current node *using
the updated state*. In our case, the state is represented entirely by `Env`

, so
our inner loop type will be `Exp -> Env -> (Int, Env)`

.

If you have read my series on monads, this description should be enough to get you started thinking about a monadic approach. We'll get to that, but for now let's start with a more direct approach.

If we want our top-level API to be `Exp -> Int`

and need our recursion to be
`Exp -> Env -> (Int, Env)`

, we need an inner helper function, which we'll call
`loop`

, as its main purpose is to loop.

This gives us the following structure for our interpreter:

```
naive_ast_walk :: Exp -> Int
naive_ast_walk ex =
let (r, _) = loop ex mt_env
in r
where
loop :: Exp -> Env -> (Int, Env)
loop exp0 env0 =
case exp0 of
Lit v -> undefined
Var n -> undefined
Set n exp1 -> undefined
Bin op e1 e2 -> undefined
Do first rest -> undefined
While condition body -> undefined
```

All that is left is to give a precise meaning for each node type in our parse
tree. Let's start with the easiest one: `Lit i`

:

```
Lit v -> (v, env0)
```

We don't change the environment, and we simply return the literal value we're
given. `Var n`

is just a little bit more complex, as we need to extract the
value from the environment. Fortunately, we've already defined an abstraction
for that around our `Env`

representation, so it is not all that much harder:

```
Var n -> (lookup env0 n, env0)
```

Next up is `Set n exp1`

, which is the first operation where we need to recurse
on a subtree. The important part here is that the right-hand-side of the
assignment could, in principle, have side-effects, so we need to be careful not
to drop the new state on the floor:

```
Set n exp1 -> let (v, env1) = loop exp1 env0
in (v, insert env1 n v)
```

This is about as complex as it gets. The next operation, `Bin`

, just applies
the same recursion mechanism twice:

```
Bin op e1 e2 -> do
let (v1, env1) = loop e1 env0
let (v2, env2) = loop e2 env1
((bin op) v1 v2, env2)
```

where we have defined `bin`

as:

```
bin :: Op -> Int -> Int -> Int
bin = \case
Add -> (+)
NotEq -> \v1 v2 -> if v1 /= v2 then 1 else 0
```

`Do`

is just a simpler version of `Bin`

: we still recurse twice, but this time
we don't need to keep the result from the first recursion (though we do still
need the *state* from it):

```
Do first rest -> do
let (_, env1) = loop first env0
loop rest env1
```

Finally, our `While`

loop needs to evaluate its condition, then optionally
evaluate its body and start over.

```
While condition body -> do
let (c, env1) = loop condition env0
if c == 1
then do
let (_, env2) = loop body env1
loop exp0 env2
else (bottom, env1)
```

where `bottom`

is defined as

```
bottom :: Int
bottom = undefined
```

### Aside: a note on language design

Note that there are a few important *language design choices* being made here:

- We have decided not to have booleans, and implicitly defined
`1`

as being true and*any other integer value*as being false. - We have decided that the result of a while loop is to crash, lazily. That
means that if someone wrote something like this:
`a = while (...) {...} b = a + 1`

we would crash, at runtime, on the second line.

- We have also chosen to define the return value of an assignment as the assigned value. More generally, we do not distinguish between statements and expressions.

To be clear, these are, overall, pretty bad choices, and if you are trying to
design your own language, I am not advocating for these particular design
choices. This exercise, however, is *not* about language design, but about
exploring different ways to write an interpreter. Language design is a mostly
orthogonal concern, and the semantics we have here, while not great for a
practical language, are good enough for our purpose.

### A monadic approach

As mentioned, this notion of traversing a tree while keeping some state around is a perfect use-case for monads. Here is an equivalent, monad-based implementation of the same tree-walking strategy:

```
data EvalExec a where
EvalBind :: EvalExec a -> (a -> EvalExec b) -> EvalExec b
EvalReturn :: a -> EvalExec a
EvalLookup :: Int -> EvalExec Int
EvalSet :: Int -> Int -> EvalExec ()
instance Functor EvalExec where fmap = liftM
instance Applicative EvalExec where pure = return; (<*>) = ap
instance Monad EvalExec where return = EvalReturn; (>>=) = EvalBind
-- name stands for "tree walking evaluator, monadic version"
twe_mon :: Exp -> Int
twe_mon exp =
exec (eval exp) mt_env (\_ r -> r)
where
eval :: Exp -> EvalExec Int
eval = \case
Lit v -> return v
Var n -> do
v <- EvalLookup n
return v
Set n exp -> do
v <- eval exp
EvalSet n v
return v
Bin op e1 e2 -> do
v1 <- eval e1
v2 <- eval e2
return $ (bin op) v1 v2
Do first rest -> do
_ <- eval first
eval rest
While condition body -> do
c <- eval condition
if 1 == c
then do
_ <- eval body
eval (While condition body)
else return bottom
exec :: EvalExec a -> Env -> (Env -> a -> Int) -> Int
exec m env cont = case m of
EvalBind prev step -> exec prev env (\env ret -> exec (step ret) env cont)
EvalReturn v -> cont env v
EvalLookup n -> cont env (lookup env n)
EvalSet n v -> cont (insert env n v) ()
```

The implementation of `exec`

here relies on a technique called
"continuation-passing style", commonly abbreviated CPS, which will be discussed
in more depth in a future part of this series.

Overall, I would say the extra complexity of using a monad in this case does
*not* pay for itself. The amount of state we have to thread through, and the
amount of operations that we want to execute on it, are both small enough that
the code is overall simpler with the direct, non-monadic approach.

This may not remain the case if we made the language more complex, for example by adding other types of side effects than setting variables. Like all abstractions, monads can be very useful, or just add cognitive and runtime overhead, depending on when and where they are used.

### What about performance?

Speaking of overhead, the goal of this series is to, following the agenda of Neil's talk, compare and contrast different ways to write an interpreter on their complexity to performance ratio. This requires some way to measure performance, and possibly some baseline to compare to.

At this point, we already have two implementations, so this can be a good time to look into building some form of benchmarking process. This will be the subject of the next part in this series.