Last week, I showed how to use host-language first-class functions as a way to improve the efficiency of an interpreter. The code changes were pretty small and local, which is good, but the performance benefits were not huge. In this post, we're going to explore a completely different approach: we're first going to design a stack machine, then write an interpreter for the stack machine language, and finally write a compiler from our toy language parse tree to stack machine instructions.

This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.

The following topics, while important, are out of scope:

• Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
• Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
• JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.

### What is a stack machine?

A stack machine is one where the state is represented as a stack. Other types of "machines" in this sense include Turing machines, where the state is not structured, and register machines, where the state is split among a number of individually-referenceable slots.

What does it mean for the state to be represented as a stack? It means that the core operations the machine supports will take one of the following forms:

• push a literal on the top of the stack.
• Read the value at the top of the stack and remove it (pop).
• Combinations of the above: pop a number of values from the stack, do some computation with them, and push the result back on the stack.

If we compare that to our toy language definition, this gives us direct equivalents to Lit and Bin, but it's not clear how to represent the other operations. For reference, here is the source language:

data Exp
= Lit Int
| Var Int
| Set Int Exp
| Bin Op Exp Exp
| Do Exp Exp
| While Exp Exp
deriving Show


### Stack machine language

Let us first consider the issue of variables, because Neil actually explains how to handle that in his talk: if we assume we can also access the bottom of the stack, we can just read variable values from there. This suggests two more operations: get and set, which operate on indices based off the bottom of the stack.

This leaves us with having to choose representations for Do and While. We could mirror the Exp structure, but if I'm going to write a compiler, I want to take the opportunity to drop the level of abstraction, because that's where the potential for optimization comes from. With that in mind, here is the representation I chose:

data StackOp
= StackPush Int
| StackSet Int
| StackGet Int
| StackBin Op
| StackJump Int
| StackJumpIfZero Int
| StackEnd
deriving (Show)

compile_stack :: Exp -> [StackOp]


Of note:

• The compiler is returning a list of StackOp: I have explicitly chosen a "flat" list structure over the tree structure we had in the parse tree. There will be no explicit equivalent to the Do expression.
• There is no pop operation, because we never pop in a vacuum: popping the stack will be done as part of the operations that need to read something from the Stack (StackBin, StackJumpIfZero).
• I chose to represent looping using a combination of conditional and unconditional jumps. This should be good enough to represent any surface-language loops, should the language grow variations like do ... while or for. A smart compiler could even convert tail-recursive functions to a loop using those primitives, if we had functions in the surface language.
• I chose to put an explicit End case, because that means the execution code does not need to constantly check if it has reached the end of the list. Because of the jump-based looping logic, we're not going to just traverse the list of instructions once, but instead we're going to jump around based on indices, and comparing current index to length on each iteration is a bit cumbersome. Although we're not going to make use of that here, this also gives us the option of early termination.

### Stack abstraction

Having defined the data structure, I can now represent programs for my stack machine interpreter. But I still need to define the meaning of such programs, which I can do by writing out said interpreter in a programming language (assuming we can agree said programming language has, itself, a defined meaning).

First, let's define a stack abstraction. The stack is the core state of our machine, so it is worth spending a bit of time formalizing it. From the description of a stack machine, we know we need the following operations:

1. Look at the element at the top (and consume it).
2. Add an element at the top.
3. Look at arbitrary elements indexed from the bottom.
4. Replace an arbitrary element indexed from the bottom.

We're not going to worry about performance yet and define these five functions over a list, in order to get something simple up and running quickly:

pop :: [a] -> (a, [a])
pop ls = (head ls, tail ls)

push :: [a] -> a -> [a]
push ls a = a : ls

get :: [a] -> Int -> a
get ls pos = (reverse ls) !! pos

set :: [a] -> Int -> a -> [a]
set ls pos val =
let r = reverse ls
in reverse (take pos r ++ val : drop (pos+1) r)


With this core abstraction out of the way, we can look at the interpreter itself.1

### Stack interpreter

If we assume that the code has already been compiled, but still want to fit with our benchmarking harness, the signature for our stack machine interpreter will need to be [StackOp] -> () -> Int. Just like our source language interpreter, writing a simple interpreter for our stack machine language reduces to expressing the meaning of each operation.

We'll need an inner loop, which we'll represent as before by a recursive function called loop. This function will return an Int, and will need to keep track of the machine's state, which we claimed was a stack. That's not entirely true, though: there is one more piece of state we need to keep track of, which we've referred to already but not explicitly mentioned: where we currently are in the code. Think of a Turing machine's head position. So in our case the state is an integer and a list, and we get the signature Int -> [Int] -> Int, where the first argument is the position in the code, the second argument is the current stack, and the result is the return value of the program we're running.

Because the code is given as a flat list, we never have multiple recursions and don't need to return the current state, as we needed to do in the simple interpreter. This also means that the code will be naturally tail recursive, which we've seen may be good for performance.

The overall structure of the function will thus be:

exec_stack :: [StackOp] -> () -> Int
exec_stack code =
\() -> loop 0 []
where
loop :: Int -> [Int] -> Int
loop ip stack = case code !! ip of
StackPush v -> undefined
StackSet n -> undefined
StackGet n -> undefined
StackBin op -> undefined
StackJump i -> undefined
StackJumpIfZero i -> undefined
StackEnd -> undefined


And now all we need to do is fill in each case. The first three map pretty directly to our stack abstraction:

    StackPush v -> loop (ip + 1) (push stack v)
StackSet n -> let (v, s) = pop stack
in loop (ip + 1) (set s n v)
StackGet n -> loop (ip + 1) (push stack (get stack n))


In English words:

• We expect the argument to StackPush to be the value we want to push onto the stack.
• We expect the argument to StackSet to be the index at which we set a value. The value itself, however, is not part of the instruction, and instead is taken to implicitly be the top of the stack. This action consumes one element from the stack.
• We expect the argument to StackGet to be the index to read from the stack. Since our language is flat and there is no notion of "returning" a value to a "calling" instruction, the result of reading the variable is made available by pushing it back onto the top of the stack.
• Since none of these instructions deal explicitly with managing the "instruction pointer" ip (sometimes also called "program counter"), we simply move on to the next instruction by incrementing it by one in all cases.

The StackBin operation is a bit more involved as it contains more steps:

    StackBin op -> let (a1, s1) = pop stack
(a2, s2) = pop s1
in loop (ip + 1) (push s2 ((bin op) a2 a1))


There are a couple ways to write this, and one could argue a continuation-based approach could help straighten out the variable naming, but it's still simple enough to follow: we pick out the first two elements of the stack, remove them, compute whatever binary operation we embedded in the instruction, and then push the result back onto the stack.

The two "jump" instructions have to do with manipulating the instruction pointer, rather than the stack. The simpler of the two is the unconditional jump, which literally does nothing else than setting the pointer:

    StackJump i -> loop i stack


For the semantics of the StackJumpIfZero operation, I chose to read the top of the stack and execute the jump only if the value is zero:

    StackJumpIfZero i -> let (v, s) = pop stack
in if v == 0
then loop i s
else loop (ip + 1) s


Finally, the StackEnd instruction will return the top of the stack:

    StackEnd -> fst (pop stack)


### Stack machine compiler

Writing a (simple) compiler is very close to writing a (simple) interpreter: we walk the tree, and for each type of node, we do something appropriate. The only difference is that instead of directly executing the meaning of the current node, we produce code with an equivalent meaning.

As such, the type of our (outer) compilation function will be Exp -> [StackOp]. What will be the type of its inner loop? We'll need to walk down the tree and keep track of some "state", but this time it's not the interpreter state (i.e. no environment) but the "compiled code" state. Here is what we need to keep track of:

• For multi-recursive nodes (e.g. Bin), we need to know what code each side produced.
• When producing "jump" nodes, we need to know where to jump, so it may be important to keep track of how many instructions came before the current one.

Note that this second constraint is a choice we implicitly made when defining the meaning of our jump instructions: we have made them absolute jumps. We could instead have made them relative jumps, which is arguably better as it allows some chunks of code to be relocated.

In particular, in our case, we know that our source language is structured, and thus all jumps are constrained. This would matter in a richer language (e.g. with functions), if we wanted to try and optimize the [StackOp] produced. With our fairly poor language, this choice will not have much of an impact, but the reader is still encouraged to try and turn the given sample code to using relative jumps.

The overall structure of our compiler will look like:

compile_stack :: Exp -> [StackOp]
compile_stack exp =
loop 0 exp <> [StackEnd]
where
loop :: Int -> Exp -> [StackOp]
loop count = \case
Lit v -> undefined
Var n -> undefined
Set n e -> undefined
Bin op e1 e2 -> undefined
Do first rest -> undefined
While cond body -> undefined


and we can look at how to fill in each case.

The first two cases have a direct equivalent in our StackOp language, so we can just emit that:

    Lit v -> [StackPush v]
Var n -> [StackGet n]


Note that, as simple as these two cases are, they illustrate another choice we implicitly made earlier: in defining the type of our inner loop as returning a [StackOp], but not taking one in as an argument, we implicitly decided that it is the job of the parent node to combine the result of compiling its child nodes. Another choice could have been to consider the "list of code produced so far" as part of the state, and pass it down. This would have meant the job of the parent node was to correctly pass in the updated state from one child to the other, yielding a more monadic structure.

This observation leads us to the definition of the next case:

    Set n e -> loop count e <> [StackSet n]


The Set case almost has a direct equivalent in the StackSet node, except that the StackSet node expects its value to be on the stack. We take a leap of faith in recursion and assume that, if we simply run all of the code e compiles to first, we'll end up with a stack that has the result of e at its top.

Note that this is consistent with the choice we have made for Lit of pushing its value on top of the stack, and would also correspond to the effect of StackGet (from compiling Var). So at this point it looks like we could correctly compile x = 1 and x = y.

The next case is Bin, which as usual is not exactly more complicated than Set, but just "more of the same":

    Bin op e1 e2 ->
let c1 = loop count e1
c2 = loop (count + length c1) e2
in c1 <> c2 <> [StackBin op]


The main potential tripping point here is the need to thread through count from one child to the next: when we start compiling e2, we have to know how many instructions were needed for e1. Again, by simple concatenation of the results of compiling e1 and e2, we hope (based on recursive faith) that we'll end up with a stack where the top two values are the result of computing e1 and e2.

Besides forgetting to update the count for the second recursive call, there is another hard-to-track mistake that could happen here: the order of execution for c1 and c2. By putting c1 first, we will execute it first, which corresponds to what most people would expect in a strict language (left-to-right evaluation of arguments), but it also means that we end up with a stack where the result of c2 is above the result of c1. This needs to be taken into account in the implementation of StackBin, if Op has non-commutative members. This is not the case for our sample language here, as we only support addition and inequality. But if this language were a prototype for a richer language, this bug could lie dormant and only appear when someone adds a non-commutative operation. At that point, developers may assume that the existing code is correct, making this bug hard to properly diagnose.

As usual as well, Do is really just a simpler version of Bin:

    Do first rest ->
let c1 = loop count first
c2 = loop (count + length c1) rest
in c1 <> c2


The While case is a bit more involved, as it has more moving parts. Here is the code for it:

    While cond body ->
let cc = loop count cond
cb = loop (count + length cc + 1) body
in cc <> [StackJumpIfZero (count + length cc + 1 + length cb + 1)]
<> cb
<> [StackJump count]


In essence, it is just "even more of the same" compared to Bin; one just has to keep track of the correct count value. The two + 1 correspond to the two instructions we explicitly add here (StackJumpIfZero and StackJump) in addition to the result of compiling cond and body. They could obviously be written as a single +2, but I like having my count increments in the same order as the bits of code I'm concatenating.

And that's all there is to this compiler. For reference, running it on our sample code yields:

[StackPush 100,
StackSet 0,
StackPush 1000,
StackSet 1,
StackPush 0,
StackGet 1,
StackBin NotEq,
StackJumpIfZero 27,
StackGet 0,
StackPush 4,
StackGet 0,
StackPush 3,
StackSet 0,
StackGet 0,
StackPush 2,
StackPush 4,
StackSet 0,
StackPush (-1),
StackGet 1,
StackSet 1,
StackJump 4,
StackGet 0,
StackEnd]


and running our stack machine interpreter over that returns the expected value of -13.

### Next steps

This was an introduction to stack machines, as a general idea. I deliberately did not focus on performance here, and it shows: running our benchmark on this yields:

direct (3 runs): 3.48 ms (1161 μs/run)
direct (30 runs): 34.16 ms (1138 μs/run)
naive_ast_walk (3 runs): 42.60 ms (14201 μs/run)
naive_ast_walk (30 runs): 438.74 ms (14624 μs/run)
twe_mon (3 runs): 83.70 ms (27899 μs/run)
twe_mon (30 runs): 803.79 ms (26793 μs/run)
compile_to_closure (3 runs): 48.72 ms (16238 μs/run)
compile_to_closure (30 runs): 560.64 ms (18688 μs/run)
twe_cont (3 runs): 30.56 ms (10188 μs/run)
twe_cont (30 runs): 300.91 ms (10030 μs/run)
exec_stack (3 runs): 100.69 ms (33564 μs/run)
exec_stack (30 runs): 1.05 s (34871 μs/run)


which means all of this added complexity yielded an interpreter that is about three times as slow as the first one I described. So why mention stack machines in a series about cheap interpreter performance, when at first glance they seem rather expensive and slow? Because a stack machine interpreter can be made more performant, in part due to the linear structure of its code (as opposed to the tree-based structure of our parse tree). In the next part of this series, we'll be looking at how to make that happen.

1. Having defined those as the underlying "primitive" operations to make on the "state" of our stack machine makes them very good candidates for a stack machine monad, though we're not going to go there now.

Tags: cheap interpreter