Cheap interpreter, part 4: stack machines
Last week, I showed how to use host-language first-class functions as a way to improve the efficiency of an interpreter. The code changes were pretty small and local, which is good, but the performance benefits were not huge. In this post, we're going to explore a completely different approach: we're first going to design a stack machine, then write an interpreter for the stack machine language, and finally write a compiler from our toy language parse tree to stack machine instructions.
This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.
The following topics, while important, are out of scope:
- Parsing. We're assuming we start with a parse tree.
- Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
- Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
- JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.
What is a stack machine?
A stack machine is one where the state is represented as a stack. Other types of "machines" in this sense include Turing machines, where the state is not structured, and register machines, where the state is split among a number of individually-referenceable slots.
What does it mean for the state to be represented as a stack? It means that the core operations the machine supports will take one of the following forms:
push
a literal on the top of the stack.- Read the value at the top of the stack and remove it (
pop
). - Combinations of the above:
pop
a number of values from the stack, do some computation with them, andpush
the result back on the stack.
If we compare that to our toy language definition, this gives us
direct equivalents to Lit
and Bin
, but it's not clear how to represent the
other operations. For reference, here is the source language:
data Exp
= Lit Int
| Var Int
| Set Int Exp
| Bin Op Exp Exp
| Do Exp Exp
| While Exp Exp
deriving Show
Stack machine language
Let us first consider the issue of variables, because Neil actually
explains how to handle that in his talk: if we assume we can also
access the bottom of the stack, we can just read variable values from there.
This suggests two more operations: get
and set
, which operate on indices
based off the bottom of the stack.
This leaves us with having to choose representations for Do
and While
. We
could mirror the Exp
structure, but if I'm going to write a compiler, I
want to take the opportunity to drop the level of abstraction, because that's
where the potential for optimization comes from. With that in mind, here is the
representation I chose:
data StackOp
= StackPush Int
| StackSet Int
| StackGet Int
| StackBin Op
| StackJump Int
| StackJumpIfZero Int
| StackEnd
deriving (Show)
compile_stack :: Exp -> [StackOp]
Of note:
- The compiler is returning a list of
StackOp
: I have explicitly chosen a "flat" list structure over the tree structure we had in the parse tree. There will be no explicit equivalent to theDo
expression. - There is no
pop
operation, because we neverpop
in a vacuum:pop
ping the stack will be done as part of the operations that need to read something from the Stack (StackBin
,StackJumpIfZero
). - I chose to represent looping using a combination of conditional and
unconditional jumps. This should be good enough to represent any
surface-language loops, should the language grow variations like
do ... while
orfor
. A smart compiler could even convert tail-recursive functions to a loop using those primitives, if we had functions in the surface language. - I chose to put an explicit
End
case, because that means the execution code does not need to constantly check if it has reached the end of the list. Because of the jump-based looping logic, we're not going to just traverse the list of instructions once, but instead we're going to jump around based on indices, and comparing current index to length on each iteration is a bit cumbersome. Although we're not going to make use of that here, this also gives us the option of early termination.
Stack abstraction
Having defined the data structure, I can now represent programs for my stack machine interpreter. But I still need to define the meaning of such programs, which I can do by writing out said interpreter in a programming language (assuming we can agree said programming language has, itself, a defined meaning).
First, let's define a stack abstraction. The stack is the core state of our machine, so it is worth spending a bit of time formalizing it. From the description of a stack machine, we know we need the following operations:
- Look at the element at the top (and consume it).
- Add an element at the top.
- Look at arbitrary elements indexed from the bottom.
- Replace an arbitrary element indexed from the bottom.
We're not going to worry about performance yet and define these five functions over a list, in order to get something simple up and running quickly:
pop :: [a] -> (a, [a])
pop ls = (head ls, tail ls)
push :: [a] -> a -> [a]
push ls a = a : ls
get :: [a] -> Int -> a
get ls pos = (reverse ls) !! pos
set :: [a] -> Int -> a -> [a]
set ls pos val =
let r = reverse ls
in reverse (take pos r ++ val : drop (pos+1) r)
With this core abstraction out of the way, we can look at the interpreter itself.1
Stack interpreter
If we assume that the code has already been compiled, but still want to fit
with our benchmarking harness, the signature for our stack machine
interpreter will need to be [StackOp] -> () -> Int
. Just like our source
language interpreter, writing a simple interpreter for our stack
machine language reduces to expressing the meaning of each operation.
We'll need an inner loop, which we'll represent as before by a recursive
function called loop
. This function will return an Int
, and will need to keep
track of the machine's state, which we claimed was a stack. That's not entirely
true, though: there is one more piece of state we need to keep track of, which
we've referred to already but not explicitly mentioned: where we currently are
in the code. Think of a Turing machine's head position. So in our case the
state is an integer and a list, and we get the signature Int -> [Int] -> Int
,
where the first argument is the position in the code, the second argument is
the current stack, and the result is the return value of the program we're
running.
Because the code is given as a flat list, we never have multiple recursions and don't need to return the current state, as we needed to do in the simple interpreter. This also means that the code will be naturally tail recursive, which we've seen may be good for performance.
The overall structure of the function will thus be:
exec_stack :: [StackOp] -> () -> Int
exec_stack code =
\() -> loop 0 []
where
loop :: Int -> [Int] -> Int
loop ip stack = case code !! ip of
StackPush v -> undefined
StackSet n -> undefined
StackGet n -> undefined
StackBin op -> undefined
StackJump i -> undefined
StackJumpIfZero i -> undefined
StackEnd -> undefined
And now all we need to do is fill in each case. The first three map pretty directly to our stack abstraction:
StackPush v -> loop (ip + 1) (push stack v)
StackSet n -> let (v, s) = pop stack
in loop (ip + 1) (set s n v)
StackGet n -> loop (ip + 1) (push stack (get stack n))
In English words:
- We expect the argument to
StackPush
to be the value we want to push onto the stack. - We expect the argument to
StackSet
to be the index at which we set a value. The value itself, however, is not part of the instruction, and instead is taken to implicitly be the top of the stack. This action consumes one element from the stack. - We expect the argument to
StackGet
to be the index to read from the stack. Since our language is flat and there is no notion of "returning" a value to a "calling" instruction, the result of reading the variable is made available by pushing it back onto the top of the stack. - Since none of these instructions deal explicitly with managing the
"instruction pointer"
ip
(sometimes also called "program counter"), we simply move on to the next instruction by incrementing it by one in all cases.
The StackBin
operation is a bit more involved as it contains more steps:
StackBin op -> let (a1, s1) = pop stack
(a2, s2) = pop s1
in loop (ip + 1) (push s2 ((bin op) a2 a1))
There are a couple ways to write this, and one could argue a continuation-based approach could help straighten out the variable naming, but it's still simple enough to follow: we pick out the first two elements of the stack, remove them, compute whatever binary operation we embedded in the instruction, and then push the result back onto the stack.
The two "jump" instructions have to do with manipulating the instruction pointer, rather than the stack. The simpler of the two is the unconditional jump, which literally does nothing else than setting the pointer:
StackJump i -> loop i stack
For the semantics of the StackJumpIfZero
operation, I chose to read the top
of the stack and execute the jump only if the value is zero:
StackJumpIfZero i -> let (v, s) = pop stack
in if v == 0
then loop i s
else loop (ip + 1) s
Finally, the StackEnd
instruction will return the top of the stack:
StackEnd -> fst (pop stack)
Stack machine compiler
Writing a (simple) compiler is very close to writing a (simple) interpreter: we walk the tree, and for each type of node, we do something appropriate. The only difference is that instead of directly executing the meaning of the current node, we produce code with an equivalent meaning.
As such, the type of our (outer) compilation function will be Exp -> [StackOp]
. What will be the type of its inner loop? We'll need to walk down
the tree and keep track of some "state", but this time it's not the interpreter
state (i.e. no environment) but the "compiled code" state. Here is what we need
to keep track of:
- For multi-recursive nodes (e.g.
Bin
), we need to know what code each side produced. - When producing "jump" nodes, we need to know where to jump, so it may be important to keep track of how many instructions came before the current one.
Note that this second constraint is a choice we implicitly made when defining the meaning of our jump instructions: we have made them absolute jumps. We could instead have made them relative jumps, which is arguably better as it allows some chunks of code to be relocated.
In particular, in our case, we know that our source language is
structured, and thus all jumps are constrained. This would matter in a
richer language (e.g. with functions), if we wanted to try and optimize the
[StackOp]
produced. With our fairly poor language, this choice will not have
much of an impact, but the reader is still encouraged to try and turn the given
sample code to using relative jumps.
The overall structure of our compiler will look like:
compile_stack :: Exp -> [StackOp]
compile_stack exp =
loop 0 exp <> [StackEnd]
where
loop :: Int -> Exp -> [StackOp]
loop count = \case
Lit v -> undefined
Var n -> undefined
Set n e -> undefined
Bin op e1 e2 -> undefined
Do first rest -> undefined
While cond body -> undefined
and we can look at how to fill in each case.
The first two cases have a direct equivalent in our StackOp
language, so we
can just emit that:
Lit v -> [StackPush v]
Var n -> [StackGet n]
Note that, as simple as these two cases are, they illustrate another choice we
implicitly made earlier: in defining the type of our inner loop as returning a
[StackOp]
, but not taking one in as an argument, we implicitly decided that
it is the job of the parent node to combine the result of compiling its child
nodes. Another choice could have been to consider the "list of code produced so
far" as part of the state, and pass it down. This would have meant the job of
the parent node was to correctly pass in the updated state from one child to
the other, yielding a more monadic structure.
This observation leads us to the definition of the next case:
Set n e -> loop count e <> [StackSet n]
The Set
case almost has a direct equivalent in the StackSet
node, except
that the StackSet
node expects its value to be on the stack. We take a leap
of faith in recursion and assume that, if we simply run all of the code e
compiles to first, we'll end up with a stack that has the result of e
at its
top.
Note that this is consistent with the choice we have made for Lit
of pushing
its value on top of the stack, and would also correspond to the effect of
StackGet
(from compiling Var
). So at this point it looks like we could
correctly compile x = 1
and x = y
.
The next case is Bin
, which as usual is not exactly more complicated than
Set
, but just "more of the same":
Bin op e1 e2 ->
let c1 = loop count e1
c2 = loop (count + length c1) e2
in c1 <> c2 <> [StackBin op]
The main potential tripping point here is the need to thread through count
from one child to the next: when we start compiling e2
, we have to know how
many instructions were needed for e1
. Again, by simple concatenation of the
results of compiling e1
and e2
, we hope (based on recursive faith) that
we'll end up with a stack where the top two values are the result of computing
e1
and e2
.
Besides forgetting to update the count
for the second recursive call, there
is another hard-to-track mistake that could happen here: the order of execution
for c1
and c2
. By putting c1
first, we will execute it first, which
corresponds to what most people would expect in a strict language
(left-to-right evaluation of arguments), but it also means that we end up with
a stack where the result of c2
is above the result of c1
. This needs to
be taken into account in the implementation of StackBin
, if Op
has
non-commutative members. This is not the case for our sample language here, as
we only support addition and inequality. But if this language were a prototype
for a richer language, this bug could lie dormant and only appear when someone
adds a non-commutative operation. At that point, developers may assume that the
existing code is correct, making this bug hard to properly diagnose.
As usual as well, Do
is really just a simpler version of Bin
:
Do first rest ->
let c1 = loop count first
c2 = loop (count + length c1) rest
in c1 <> c2
The While
case is a bit more involved, as it has more moving parts. Here is the
code for it:
While cond body ->
let cc = loop count cond
cb = loop (count + length cc + 1) body
in cc <> [StackJumpIfZero (count + length cc + 1 + length cb + 1)]
<> cb
<> [StackJump count]
In essence, it is just "even more of the same" compared to Bin
; one just has
to keep track of the correct count
value. The two + 1
correspond to the two
instructions we explicitly add here (StackJumpIfZero
and StackJump
) in
addition to the result of compiling cond
and body
. They could obviously be
written as a single +2
, but I like having my count increments in the same
order as the bits of code I'm concatenating.
And that's all there is to this compiler. For reference, running it on our sample code yields:
[StackPush 100,
StackSet 0,
StackPush 1000,
StackSet 1,
StackPush 0,
StackGet 1,
StackBin NotEq,
StackJumpIfZero 27,
StackGet 0,
StackPush 4,
StackBin Add,
StackGet 0,
StackBin Add,
StackPush 3,
StackBin Add,
StackSet 0,
StackGet 0,
StackPush 2,
StackBin Add,
StackPush 4,
StackBin Add,
StackSet 0,
StackPush (-1),
StackGet 1,
StackBin Add,
StackSet 1,
StackJump 4,
StackGet 0,
StackEnd]
and running our stack machine interpreter over that returns the expected value
of -13
.
Next steps
This was an introduction to stack machines, as a general idea. I deliberately did not focus on performance here, and it shows: running our benchmark on this yields:
direct (3 runs): 3.48 ms (1161 μs/run)
direct (30 runs): 34.16 ms (1138 μs/run)
naive_ast_walk (3 runs): 42.60 ms (14201 μs/run)
naive_ast_walk (30 runs): 438.74 ms (14624 μs/run)
twe_mon (3 runs): 83.70 ms (27899 μs/run)
twe_mon (30 runs): 803.79 ms (26793 μs/run)
compile_to_closure (3 runs): 48.72 ms (16238 μs/run)
compile_to_closure (30 runs): 560.64 ms (18688 μs/run)
twe_cont (3 runs): 30.56 ms (10188 μs/run)
twe_cont (30 runs): 300.91 ms (10030 μs/run)
exec_stack (3 runs): 100.69 ms (33564 μs/run)
exec_stack (30 runs): 1.05 s (34871 μs/run)
which means all of this added complexity yielded an interpreter that is about three times as slow as the first one I described. So why mention stack machines in a series about cheap interpreter performance, when at first glance they seem rather expensive and slow? Because a stack machine interpreter can be made more performant, in part due to the linear structure of its code (as opposed to the tree-based structure of our parse tree). In the next part of this series, we'll be looking at how to make that happen.
Having defined those as the underlying "primitive" operations to make on the "state" of our stack machine makes them very good candidates for a stack machine monad, though we're not going to go there now.
↩