Last week I presented a simple strategy for building an interpreter, written in two different styles. This week we're going to start on the main purpose of this series: discussing optimization techniques, with a particular focus on the tradeoff between their added complexity and their added performance.

This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.

The following topics, while important, are out of scope:

• Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
• Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
• JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.

### Very crude benchmarking

If we're going to talk about performance, we need to measure it. We also need to define what we measure. In this series, we're going to use a very simple setup, where we start with functions of type Exp -> () -> Int, give them our sample program, and then measure how long it takes to run the resulting () -> Int multiple times.

At this point this may seem like an unnecessary extra step compared to just starting with a Exp -> Int function. Looking back to our definition of an interpreter in part 1, we want to measure the time taken by the last step, which we called the evaluator. Measuring the entire process could also be interesting, but in general, code is read once and then most functions are executed many times. Therefore, in this series we assume that the cost of parsing and optimizing are irrelevant because we can amortize it over many evaluations.

Because we're starting with a parse tree (no parsing time) and we don't have any optimization step yet, for our two existing implementations we can just define the functions to benchmark as:

  let functions = [
("direct", direct),
("naive_ast_walk", \() -> naive_ast_walk ast),
("twe_mon", \() -> twe_mon ast)
]


I've included a new function, direct, as a sort of baseline for our benchmarking. Here is its definition:

direct :: () -> Int
direct _ =
loop 100 1000
where
loop :: Int -> Int -> Int
loop x0 i =
if (0 == i)
then x0
else let x1 = x0 + 4 + x0 + 3
x2 = x1 + 2 + 4
in loop x2 (i - 1)


This is a direct implementation of the same operations our sample code is doing, but written in Haskell itself. This will serve as a reference point to tell us how much overhead our interpreters are adding.

Our benchmarking function looks like:

bench :: Control.DeepSeq.NFData a => (String, () -> a) -> IO ()
bench (name, f) = do
let now = System.Clock.getTime System.Clock.Monotonic
let raw_string = Formatting.now . Data.Text.Lazy.Builder.fromString
let printDur = Formatting.fprint
(Formatting.Formatters.string
Formatting.%
raw_string " ("
Formatting.%
Formatting.Formatters.shown
Formatting.%
raw_string " runs): "
Formatting.%
Formatting.Clock.timeSpecs
Formatting.%
raw_string " ("
Formatting.%
Formatting.Formatters.string
Formatting.%
raw_string " μs/run)\n")
let ntimes :: Int -> IO ()
ntimes 0 = return ()
ntimes n = Control.DeepSeq.deepseq (f ()) (ntimes (n - 1))
let per_run t1 t2 n = do
let i1 = System.Clock.toNanoSecs t1
i2 = System.Clock.toNanoSecs t2
show $((i2 - i1) div n div 1000) start <- now ntimes 3 end <- now printDur name (3::Int) start end (per_run start end 3) start <- now ntimes 30 end <- now printDur name (30::Int) start end (per_run start end 30)  It uses the System.Clock and Formatting packages for the measurement and reporting, and the Control.DeepSeq package to force evaluation.1 Because the evaluation model of Haskell is sometimes a bit hard to predict, we run the function 3 then 30 times, and hope to get similar results. If we don't, it would probably indicate some unwanted memoization, or a GC run in the middle of the test. This is far from a scientific benchmark and if you're looking for how to benchmark Haskell code for some important purpose, please don't just reuse this function. However, it will suffice for our purposes here. Running this yields: direct (3 runs): 3.31 ms (1101 μs/run) direct (30 runs): 30.29 ms (1009 μs/run) naive_ast_walk (3 runs): 39.29 ms (13097 μs/run) naive_ast_walk (30 runs): 412.18 ms (13739 μs/run) twe_mon (3 runs): 146.17 ms (48722 μs/run) twe_mon (30 runs): 630.45 ms (21015 μs/run)  There are a couple interesting things to see here. First, the overhead of interpreting, with what should be our slowest possible interpreter, is only about 13x, which is way lower than I would have expected. For comparison, Neil reported a factor of 570 in his talk. There's obviously a chance I've done something wrong in my measurement, but I think the difference can be explained by the Rust compiler. The other strange thing in those numbers is the seemingly huge cost of running the monadic version three times. To explain why it seems so much faster to do it 30 times rather than 3, we have to invoke either garbage collection or memoization. Running the program again yields: direct (3 runs): 3.21 ms (1068 μs/run) direct (30 runs): 31.08 ms (1036 μs/run) naive_ast_walk (3 runs): 38.49 ms (12828 μs/run) naive_ast_walk (30 runs): 402.28 ms (13409 μs/run) twe_mon (3 runs): 66.85 ms (22283 μs/run) twe_mon (30 runs): 672.69 ms (22423 μs/run)  which makes me lean towards garbage collection. Finally, this gives us some estimate of the runtime cost of our monadic abstraction. In most cases, I would say the gain in clarity is worth a 2x performance penalty, but, as discussed in part 2, in this specific case I think even from a purely syntactic perspective the monadic approach is not really worth it. As a comparison point, for the Clojure version of this approach (which I'm not going to expand on as it is pretty much the exact same code, barring minor syntactic differences), the benchmark result is 2.7 microseconds per run for the baseline (direct) and 5.34 milliseconds for the naive tree walking, or a factor of nearly 2000. (The monadic version runs in about 20 milliseconds, so the overhead of my monad implementation in Clojure is about 4x). ### Building up functions So where is all of that time going? Our code sample is a simple loop, and on every single iteration of that loop, we walk the expression tree. The simplest way to speed this up is to just stop doing that: the tree itself is not changing, so we could walk it once and keep the results. Our tree is already the result of parsing another representation; we need to carefully choose the representation we'll use for "saving the result of walking the tree". We need something that is faster to traverse than a tree, but we need to account for the fact that even though the tree itself does not change, we'll traverse it with different values. In other words, we need a runnable representation of some computation to which we can pass different inputs. This can get hairy in some languages, but in "functional" ones2, we have first class functions, which means it's easy to create new functions at runtime. The main trick will be to try and make sure we compute as much as we can before creating the function. ### Compiling to host functions If we look back to the loop function in our interpreter, it takes two arguments, an Exp and an Env. When evaluating a While loop, the Exp part is going to be constant, whereas the Env part should change with each iteration (otherwise the loop is infinite). The type of our inner loop does not need to change. It used to be Exp -> Env -> (Int, Env), and it can stay that way. What will change is that we explicitly return an Env -> (Int, Env) now. Let's look at the Do case, as it's simple enough yet representative of the main mechanisms. In the naive tree walking approach, it is written as:  Do first rest -> do let (_, env1) = loop first env0 loop rest env1  If we want to cache the expression parsing, we first need to turn first and rest into functions, and then we can return a new function that calls those:  Do first rest -> let f = compile first r = compile rest in \env -> let (_, env1) = f env in r env1  If we change just one of the cases, we've just shuffled function arguments a bit. With Haskell being lazy and auto-currying, that wouldn't matter much. And with the signature unchanged, we could actually just change the Do case and keep everything else the same. However, if we change all the cases, we can build up a single function that has already done all of the tree walking and just waits for the environment. Here's the full version: compile_to_closure :: Exp -> () -> Int compile_to_closure e = let !c = compile e in \() -> (fst$ c mt_env)
where
compile :: Exp -> Env -> (Int, Env)
compile = \case
Lit v -> \env -> (v, env)
Var n -> \env -> (lookup env n, env)
Set n exp ->
let !f = compile exp
in \env ->
let (v1, env1) = f env
in (v1, insert env1 n v1)
Bin op e1 e2 ->
let !f1 = compile e1
!f2 = compile e2
in \env ->
let (v1, env1) = f1 env
(v2, env2) = f2 env1
in ((bin op) v1 v2, env2)
Do first rest ->
let !f = compile first
!r = compile rest
in \env ->
let (_, env1) = f env
in r env1
While condition body ->
let !cond = compile condition
!bod = compile body
!loop = \env ->
let (c, env1) = cond env
in if 1 == c
then let (_, env2) = bod env1
in loop (env2)
else (bottom, env1)
in loop


I must admit, when I wrote this, I wasn't quite sure whether that would help. As I just mentioned, with currying and laziness, this really is, in a sense, the same code as the naive_ast_walk function, and one could argue that a "sufficiently smart compiler" should see through the difference.

In a non-lazy, non-currying language like, say, Clojure, though, the two functions are semantically very different, so I was hoping that restructuring the code to that form could help the Haskell compiler generate more optimized code. Here are the results:

direct (3 runs): 2.98 ms (991 μs/run)
direct (30 runs): 30.17 ms (1005 μs/run)
naive_ast_walk (3 runs): 37.98 ms (12659 μs/run)
naive_ast_walk (30 runs): 375.84 ms (12527 μs/run)
twe_mon (3 runs): 77.71 ms (25902 μs/run)
twe_mon (30 runs): 770.68 ms (25689 μs/run)
compile_to_closure (3 runs): 40.58 ms (13526 μs/run)
compile_to_closure (30 runs): 389.53 ms (12984 μs/run)


So, basically no measurable impact. Which, again, makes sense. I just wish I knew of a way to ascertain whether that's because the compiler is smart enough to already do that optimization in the naive_ast_walk function, or because it's too lazy to take advantage of it in compile_to_closure. I want to lean towards the former, as adding (or removing) the "bangs" in the function bindings has no impact.

For comparison, in his talk, Neil reports a 30% overhead reduction by using this form (in Rust), and in my own experiment, on the Clojure side this transformation runs in 1.9 milliseconds, or about 2.7 times faster than the naive tree walk.

For reference, here is the Clojure code:

(defn naive-ast-walk
[expr]
(let [h (fn h [expr env]
(match expr
[:lit v] [v env]
[:var idx] [(get env idx) env]
[:set idx e] (let [[v env] (h e env)]
[v (assoc env idx v)])
[:bin op e1 e2] (let [f (bin op)
[v1 env] (h e1 env)
[v2 env] (h e2 env)]
[(f v1 v2) env])
(h tail env))
[:while e-condition e-body] (loop [env env]
(let [[condition env] (h e-condition env)]
(if (== condition 1)
(let [[_ env] (h e-body env)]
(recur env))
[nil env])))))]
(first (h expr {}))))

(defn compile-to-closure
[expr]
(let [h (fn h [expr]
(match expr
[:lit e] (fn [env] [e env])
[:var idx] (fn [env] [(get env idx) env])
[:set idx e] (let [f (h e)]
(fn [env]
(let [[v env] (f env)]
[v (assoc env idx v)])))
[:bin op e1 e2] (let [f (bin op)
f1 (h e1)
f2 (h e2)]
(fn [env]
(let [[v1 env] (f1 env)
[v2 env] (f2 env)]
[(f v1 v2) env])))
tail-body (h tail)]
(fn [env]
(tail-body env))))
[:while e-condition e-body] (let [f-condition (h e-condition)
f-body (h e-body)]
(fn [env]
(let [[condition env] (f-condition env)]
(if (== condition 1)
(let [[_ env] (f-body env)]
(recur env))
[nil env]))))))
cc (h expr)]
#(first (cc {}))))


### Continuations

When a function is called, at runtime, there is a notion of what happens after that function: it will return a value, and then something else will take that value and use it. There is a style of programming in which this notion is reified as something usually called a continuation. It's actually just a normal function, but it is used in a specific way. Let's illustrate with an example.

(defn add
[a b]
(+ a b))

(println (* x 3)))


In this code snippet, what happens after the call to add is that the result gets multiplied by 3, and then printed. In a sense, the continuation of that specific call could be written:

(defn continuation
[x]
(println (* x 3)))


and the original code could be rewritten

(continuation (add 1 2))


This is not buying us much. However, if we wanted, we could now give add control over its continuation, by having it take it as a parameter. The above would then be written as:

(add 1 2 continuation)


or, as a more complete snippet:

(defn add
[a b k]
(k (+ a b)))

(defn mul
[a b k]
(k (* a b)))

(add 1 2 (fn [x] (mul x 3 (fn [y] (println y)))))


In many cases, writing code in CPS style directly is painful and quite pointless. The idea, however, is important for (at least) three reasons:

1. Code written in "full" CPS style is always tail-recursive, and technically never "returns" from any function. These are properties that can be exploited by compilers in all sorts of interesting ways. Note that code does not need to be explicitly written in CPS: it is always possible to mechanically transform code into CPS form, so compilers can do it themselves if they want to work with that representation.
2. Manually writing code in CPS form can help with stack management: in languages that eliminate tail calls, it turns potentially stack-using code into non-stack-using code. In languages that do not eliminate tail calls, a transformation to CPS form can help implement trampolines.
3. Explicitly passing its continuation to a function gives that function the power to decide if, and when, to call that continuation. While the resulting code is not necessarily in full CPS form anymore (depending on how exactly you define that), having each function take its intended continuation as an explicit argument allows for user-level implementation of things like exceptions and green threads.

While all of those are great reasons to study CPS in more depth, in this article I'm not interested in any of them. Instead, I'm interested in one of the few cases where writing explicit CPS is actually easier and safer than the usual "direct" form: tree walking (in Haskell).

Let's go back to our simple tree walking inner loop. Its type was Exp -> Env -> (Int, Env), where the pair is really meant as an argument for the continuation. So let's take that explicitly and turn it into CPS. Our continuation will take in an environment and the result of the current step, and will return an integer. The type of our inner loop looks a bit more complicated:

  loop :: Exp -> Env -> (Env -> Int -> Int) -> Int


but it really just means the same as before: we're going to take in an expression and an environment, and we'll produce a new environment and an integer to pass down to our continuation. Except now we do that last step explicitly. Let's look, again, at the Do case to start with.

      Do first rest -> loop first env (\env _ -> loop rest env cont)


This reads as one would describe the order of operations: evaluate the first expression using the current environment, then evaluate the rest expression using the new environment, and then proceed with the continuation.

One very nice property of this approach, specifically in Haskell, is that it lets us use shadowing to rebind the env symbol, reducing the possibilities for errors. If we contrast with what we had before:

      Do first rest -> do
let (_, env1) = loop first env0
loop rest env1


at the point of calling loop on rest we have two environments in scope, and we have to pick the right one. One might naively think of trying to just rebind the name in the let, in this case:

  loop exp0 env0 = -- this env0 is shadowed more than expected
case exp0 of
...
Do first rest -> do
let (_, env0) = loop first env0
loop rest env0
...


but, in Haskell, that actually results in an infinite loop because let is lazy: the env0 on the left-hand side is indeed shadowing the env0 passed as an argument to the function definition in the loop exp0 env0 expression, but it is also shadowing the env0 the loop first env0 expression. There are uses for this behaviour, but it is a somewhat surprising default for people coming from another language.

Here is the full tree-walking interpreter using CPS:

twe_cont :: Exp -> Int
twe_cont e =
loop e mt_env (\_ r -> r)
where
loop :: Exp -> Env -> (Env -> Int -> Int) -> Int
loop exp env cont =
case exp of
Lit v -> cont env v
Var n -> cont env (lookup env n)
Set n exp -> loop exp env (\env v -> cont (insert env n v) v)
Bin op e1 e2 -> loop e1 env (\env v1 ->
loop e2 env (\env v2 ->
cont env ((bin op) v1 v2)))
Do first rest -> loop first env (\env _ -> loop rest env cont)
While condition body -> loop condition env (\env condition_value ->
if (1 == condition_value)
then loop body env (\env _ -> loop exp env cont)
else cont env bottom)


Haskell is one of those languages that do tail call elimination, so this should safe us a lot of back-and-forth building up and tearing down call frames on the stack. Adding this function to our benchmark, we get:

direct (3 runs): 2.85 ms (948 μs/run)
direct (30 runs): 30.28 ms (1009 μs/run)
naive_ast_walk (3 runs): 38.50 ms (12832 μs/run)
naive_ast_walk (30 runs): 380.42 ms (12680 μs/run)
twe_mon (3 runs): 65.72 ms (21908 μs/run)
twe_mon (30 runs): 632.60 ms (21086 μs/run)
compile_to_closure (3 runs): 39.88 ms (13294 μs/run)
compile_to_closure (30 runs): 400.59 ms (13353 μs/run)
twe_cont (3 runs): 27.58 ms (9193 μs/run)
twe_cont (30 runs): 267.38 ms (8912 μs/run)


which is a pretty nice improvement.

Another context in which one often needs to walk down a tree while rebinding variables is in writing state-carrying monads, specifically the exec function. Thus, using CPS is often a good idea there too, as I showed last week for this interpreter's monad.

Clojure is not a language with tail call elimination, meaning that trying this approach directly results in a StackOverflowError:

(defn twe-cont1
[expr]
(let [h (fn h [expr env cont]
(match expr
[:lit v] (cont env v)
[:var idx] (cont env (get env idx))
[:set idx e] (h e env (fn [env v] (cont (assoc env idx v) v)))
[:bin op e1 e2] (h e1 env
(fn [env v1]
(h e2 env
(fn [env v2]
(cont env ((bin op) v1 v2))))))
[:do head tail] (h head env (fn [env _] (h tail env cont)))
[:while e-condition e-body]
(h e-condition env
(fn [env c]
(if (== 1 c)
(h e-body env (fn [env _] (h expr env cont)))
(cont env nil))))))]
(h expr {} (fn [_ v] v))))

t.core=> (twe-cont1 ast)
Execution error (StackOverflowError) at t.core/twe-cont1$h$fn (core.clj:129).
null

t.core =>


As a possible saving grace, Clojure does have core language support for trampolines, meaning that turning this code into one that does not stack-overflow requires just one tiny adjustment:

(defn twe-cont2
[expr]
(let [h (fn h [expr env cont]
#(match expr
[:lit v] (cont env v)
[:var idx] (cont env (get env idx))
[:set idx e] (h e env (fn [env v] (cont (assoc env idx v) v)))
[:bin op e1 e2] (h e1 env
(fn [env v1]
(h e2 env
(fn [env v2]
(cont env ((bin op) v1 v2))))))
[:do head tail] (h head env (fn [env _] (h tail env cont)))
[:while e-condition e-body]
(h e-condition env
(fn [env c]
(if (== 1 c)
(h e-body env (fn [env _] (h expr env cont)))
(cont env nil))))))]
(h expr {} (fn [_ v] v))))


If you haven't spotted it, here is a zoomed-in version:

;           v
#(match expr
;           ^


This change makes each step return a zero-argument function instead of trying to compute the entire thing in one go. We can then pass this to the builtin trampoline function, which will keep evaluating the returned function until the return type is not a zero-argument function:

t.core=> (twe-cont2 ast)
#object[t.core$twe_cont2$h__3885$fn__3886 0x694e5b07 "t.core$twe_cont2$h__3885$fn__3886@694e5b07"]
t.core=> (trampoline (twe-cont2 ast))
-13
t.core=>


While it's nice that getting back to a non-crashing version is that easy, having to go through a trampoline is obviously not as good for performance as having tail call elimination. Trampolining twe-cont on our ast sample benchmarks at about 6.6 milliseconds, i.e. about 20% slower than the "direct" (i.e. non-CPS) approach.

### Conclusion & next steps

In this post we've seen two ways to use first-class functions in the host language in writing an interpreter: as continuations and as "compiled", reusable pieces of behaviour. We've also started measuring performance, and our most important finding is, in my opinion, how host-language dependent the benefits of each technique are.

This just reinforces the age-old mantra for performance optimization: measure. If you care about performance, looking at what other people have done is always good for inspiration, but don't take anything on faith: measure the impact in your language, on your use-case and using representative sample of your inputs.

Next, we'll be looking at stack machines.

1. In this specific case, deepseq could be replaced by seq from Prelude, because the return value is a simple Int.

2. Different people have different definitions of what makes a language "functional". I've recently written a series of posts explaining my stance on the subject, but for this specific context here having first-class functions is good enough.

Tags: cheap interpreter