Cheap interpreter, part 2: tree-walking
Last week I described the general structure of an interpreter, and gave a cursory introduction to parsing. From this point on, I will simply assume I am starting with a parse tree, and ignore how it was produced.
This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.
The following topics, while important, are out of scope:
- Parsing. We're assuming we start with a parse tree.
- Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
- Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
- JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.
Code sample
The talk uses a specific code snippet as the running example across all approaches. We'll do the same. Here is the snippet as given in the talk:
x = 100;
for (i = 1000; i != 0; i--) {
x = x + 4 + x + 3;
x = x + 2 + 4;
}
x
Parse tree
The talk presents all of its approaches in Rust, but I don't know Rust so I made my explorations in a mix of Haskell and Clojure. I think there are interesting things to say about both, so, when relevant, I'll present both in this series.
The parse trees are going to look superficially different, but they encode the exact same structure. In Clojure, all I could do is show you an example, and hope you can infer the general, underlying structure. In Haskell, however, I can describe the structure itself. Here it is:
data Op
= Add
| NotEq
deriving Show
data Exp
= Lit Int
| Var Int
| Set Int Exp
| Bin Op Exp Exp
| Do Exp Exp
| While Exp Exp
deriving Show
This is not meant to be a complete language; it is only meant to cover the specific operations we see in our one snippet. We have, in order:
Lit i
: literal values; for simplicity, we assume all values in the language are integers.Var i
: access a variable from the environment. Again, for simplicity, we assume variable names are integers. This can always be done, and usually is, because indexing by number is faster than looking up by string.Set i e
: set the value of variablei
to be the result of evaluating expressione
. For example,Set 0 (Lit 3)
would represent the instructionx = 3
.Bin op e1 e2
: apply the binary operatorop
to the results of evaluating expressionse1
ande2
, for exampleBin Add (Lit 3) (Lit 4)
would represent3 + 4
.Do e1 e2
is our main mean of sequencing operations. One could be tempted to represent a list of operations as a list, but representing it as a tree-shaped structure makes writing code against it easier. Concrete parse trees become a bit more cumbersome, but keep in mind that in general they are produced by the parser. Being easier to produce and consume is better than being easier to write down by hand, or to read as a human.While condition body
is our looping mechanism: we'll first evaluate thecondition
, and if it evaluates to 1, we'll evaluate thebody
and come back tocondition
.
We only need two binary operations for this snippet, so we could have just made
two entries in Exp
called Add Exp Exp
and NotEq Exp Exp
, but, as for
Do
, isolating binary operations makes writing code against the representation
easier, because it makes the Exp
type simpler. The tradeoff is that it makes
Exp
values more complex, but that is a tradeoff worth making.
With that representation, our code snippet will look like:
ast :: Exp
ast =
-- x = 100
(Do (Set 0 (Lit 100))
-- i = 1000
(Do (Set 1 (Lit 1000))
-- for (; i != 0;)
(Do (While (Bin NotEq (Lit 0)
(Var 1))
-- x = (((x + 4) + x) + 3)
(Do (Set 0 (Bin Add (Bin Add (Bin Add (Var 0)
(Lit 4))
(Var 0))
(Lit 3)))
-- x = ((x + 2) + 4)
(Do (Set 0 (Bin Add (Bin Add (Var 0)
(Lit 2))
(Lit 4)))
-- i = i + (-1)
(Set 1 (Bin Add (Lit (-1))
(Var 1))))))
-- return x
(Var 0))))
Walking the tree
Let's start with the most straightforward approach. First, what are we trying
to do? We want a function that takes in an Exp
, but then what? We could try
to handle side effects, but for the sake of simplicity here we'll just assume
our programs return an Int
. This would yield a function of type Exp -> Int
.
While our language does not have "external" effects, such as printing, it does
have "internal" effects, like setting the value of a variable. We'll need to
represent that somehow. The most direct way to represent that is to carry
around an environment, which we'll represent as a Map Int Int
with insert
and lookup
wrapper functions:
newtype Env = Env (Data.Map Int Int)
deriving Show
lookup :: Env -> Int -> Int
lookup (Env m) n = maybe bottom id (Data.Map.lookup n m)
insert :: Env -> Int -> Int -> Env
insert (Env m) n v = Env $ Data.Map.insert n v m
We'll also need an easy way to start from an empty one:
mt_env :: Env
mt_env = Env Data.Map.empty
It's been argued that the most natural way to traverse a tree is to use a recursive function. I strongly dispute the notion that there is anything "natural" about pretty much any programming approach, but recursion certainly is a good practical way of walking down a tree. Our recursion will need to carry over two pieces of state:
- The expression we (still) need to evaluate, and
- The current environment, i.e. the current value of all our variables.
As a first attempt, one could try thinking about the recursive loop type as
Exp -> Env -> Int
. That would not work, though, because we will have cases
where we want to go down one child of the current node, collecting state
changes on the way, and then process the next child of the current node using
the updated state. In our case, the state is represented entirely by Env
, so
our inner loop type will be Exp -> Env -> (Int, Env)
.
If you have read my series on monads, this description should be enough to get you started thinking about a monadic approach. We'll get to that, but for now let's start with a more direct approach.
If we want our top-level API to be Exp -> Int
and need our recursion to be
Exp -> Env -> (Int, Env)
, we need an inner helper function, which we'll call
loop
, as its main purpose is to loop.
This gives us the following structure for our interpreter:
naive_ast_walk :: Exp -> Int
naive_ast_walk ex =
let (r, _) = loop ex mt_env
in r
where
loop :: Exp -> Env -> (Int, Env)
loop exp0 env0 =
case exp0 of
Lit v -> undefined
Var n -> undefined
Set n exp1 -> undefined
Bin op e1 e2 -> undefined
Do first rest -> undefined
While condition body -> undefined
All that is left is to give a precise meaning for each node type in our parse
tree. Let's start with the easiest one: Lit i
:
Lit v -> (v, env0)
We don't change the environment, and we simply return the literal value we're
given. Var n
is just a little bit more complex, as we need to extract the
value from the environment. Fortunately, we've already defined an abstraction
for that around our Env
representation, so it is not all that much harder:
Var n -> (lookup env0 n, env0)
Next up is Set n exp1
, which is the first operation where we need to recurse
on a subtree. The important part here is that the right-hand-side of the
assignment could, in principle, have side-effects, so we need to be careful not
to drop the new state on the floor:
Set n exp1 -> let (v, env1) = loop exp1 env0
in (v, insert env1 n v)
This is about as complex as it gets. The next operation, Bin
, just applies
the same recursion mechanism twice:
Bin op e1 e2 -> do
let (v1, env1) = loop e1 env0
let (v2, env2) = loop e2 env1
((bin op) v1 v2, env2)
where we have defined bin
as:
bin :: Op -> Int -> Int -> Int
bin = \case
Add -> (+)
NotEq -> \v1 v2 -> if v1 /= v2 then 1 else 0
Do
is just a simpler version of Bin
: we still recurse twice, but this time
we don't need to keep the result from the first recursion (though we do still
need the state from it):
Do first rest -> do
let (_, env1) = loop first env0
loop rest env1
Finally, our While
loop needs to evaluate its condition, then optionally
evaluate its body and start over.
While condition body -> do
let (c, env1) = loop condition env0
if c == 1
then do
let (_, env2) = loop body env1
loop exp0 env2
else (bottom, env1)
where bottom
is defined as
bottom :: Int
bottom = undefined
Aside: a note on language design
Note that there are a few important language design choices being made here:
- We have decided not to have booleans, and implicitly defined
1
as being true and any other integer value as being false. - We have decided that the result of a while loop is to crash, lazily. That
means that if someone wrote something like this:
a = while (...) {...} b = a + 1
we would crash, at runtime, on the second line.
- We have also chosen to define the return value of an assignment as the assigned value. More generally, we do not distinguish between statements and expressions.
To be clear, these are, overall, pretty bad choices, and if you are trying to design your own language, I am not advocating for these particular design choices. This exercise, however, is not about language design, but about exploring different ways to write an interpreter. Language design is a mostly orthogonal concern, and the semantics we have here, while not great for a practical language, are good enough for our purpose.
A monadic approach
As mentioned, this notion of traversing a tree while keeping some state around is a perfect use-case for monads. Here is an equivalent, monad-based implementation of the same tree-walking strategy:
data EvalExec a where
EvalBind :: EvalExec a -> (a -> EvalExec b) -> EvalExec b
EvalReturn :: a -> EvalExec a
EvalLookup :: Int -> EvalExec Int
EvalSet :: Int -> Int -> EvalExec ()
instance Functor EvalExec where fmap = liftM
instance Applicative EvalExec where pure = return; (<*>) = ap
instance Monad EvalExec where return = EvalReturn; (>>=) = EvalBind
-- name stands for "tree walking evaluator, monadic version"
twe_mon :: Exp -> Int
twe_mon exp =
exec (eval exp) mt_env (\_ r -> r)
where
eval :: Exp -> EvalExec Int
eval = \case
Lit v -> return v
Var n -> do
v <- EvalLookup n
return v
Set n exp -> do
v <- eval exp
EvalSet n v
return v
Bin op e1 e2 -> do
v1 <- eval e1
v2 <- eval e2
return $ (bin op) v1 v2
Do first rest -> do
_ <- eval first
eval rest
While condition body -> do
c <- eval condition
if 1 == c
then do
_ <- eval body
eval (While condition body)
else return bottom
exec :: EvalExec a -> Env -> (Env -> a -> Int) -> Int
exec m env cont = case m of
EvalBind prev step -> exec prev env (\env ret -> exec (step ret) env cont)
EvalReturn v -> cont env v
EvalLookup n -> cont env (lookup env n)
EvalSet n v -> cont (insert env n v) ()
The implementation of exec
here relies on a technique called
"continuation-passing style", commonly abbreviated CPS, which will be discussed
in more depth in a future part of this series.
Overall, I would say the extra complexity of using a monad in this case does not pay for itself. The amount of state we have to thread through, and the amount of operations that we want to execute on it, are both small enough that the code is overall simpler with the direct, non-monadic approach.
This may not remain the case if we made the language more complex, for example by adding other types of side effects than setting variables. Like all abstractions, monads can be very useful, or just add cognitive and runtime overhead, depending on when and where they are used.
What about performance?
Speaking of overhead, the goal of this series is to, following the agenda of Neil's talk, compare and contrast different ways to write an interpreter on their complexity to performance ratio. This requires some way to measure performance, and possibly some baseline to compare to.
At this point, we already have two implementations, so this can be a good time to look into building some form of benchmarking process. This will be the subject of the next part in this series.