Cheap interpreter, part 10: fastest one yet, then a hundred times faster
Last week we started looking at dynamic code generation for a host (dynamic) language, and how that can yield faster code than any interpreting we'd be doing ourselves in said language. But the code we generated was still using the exact same logic.
In this final post in the series, we'll look into how we can improve on that, yielding much faster code than last week (and thus leaving our Haskell interpreter in the dust).
Finally, I'll show how breaking the "cheap" rule can, quite easily, result in yet again much faster code than even the best Clojure code we can generate.
This series is based on Neil Mitchell's talk "Cheaply writing a fast interpeter". The talk compares a number of approaches to writing an interpreter and tries to find a good balance between complexity and interpreter overhead.
The following topics, while important, are out of scope:
- Parsing. We're assuming we start with a parse tree.
- Producing assembly code: the definition of "cheap" that Neil uses in the talk is "maintainers are not required to know another language" (specifically assembly).
- Semantic optimizations (constant folding, algebraic transformations, etc.). The goal is to compare the interpretation overhead of various approaches, and semantic optimizations are considered orthogonal to that.
- JIT is not explicitly excluded in the talk, but it is completely absent. This is probably also a consequence of the "cheap" constraint.
Generating faster code
In order to generate faster code, we have to identify performance issues with the code we're currently generating. Looking at the code from last week, the most obvious issue is that we're still manipulating an array. That was arguably the fastest way to manage our registers when we were interpreting, but for our host language there is a much faster option: local variables.
Clojure does not have local variables, exactly, but we can use local, shadowing let-bindings to the same effect, provided we never capture a binding.
A less obvious potential speedup would be to generate a dense switch, rather than a sparse one. This is less reliable and harder to measure: how switch density affects JIT is, like most things JVM-JIT-related, a bit of a mystery, and somewhat JVM-version dependent. However, it's really easy to do when we're generating our own code, so we're going to do that too.
Keeping track of variables
Clojure's let
binding allows rebinding. For example, the expression:
(let [a 5
a (* 2 a)
a (+ a 3)]
a)
results in 13
. The semantics are pretty clear: every right-hand expression
uses the binding from the lines above it, and then the binding happens and
the left-hand shadows.
It should be pretty clear that the above is faster than
(let [regs (int-array 1)]
(aset regs 0 5)
(aset regs 0 (* 2 (aget regs 0)))
(aset regs 0 (+ (aget regs 0) 3))
(aget regs 0))
Benchmarking both expressions on my machine yields, respectively, 16 and 38 nanoseconds. On the one hand it's only a 22 nanosecond differefence, but on the other hand it's a 2.5 times difference.
In order to switch from the array-based approach to a let-based approach, we first need to create our variable names. Here's how we do that:
(defn vars-from-reg
[code]
(->> code
(remove (comp #{:jump} first))
(map second)
set
sort
(map-indexed (fn [idx i]
[i [idx (gensym)]]))
(into {})))
Almost all of our instructions write to a register, which appears in second
position in our Clojure representation. The main exceptions are :jump
, which
we exclude, and :jump-if-zero
and :return
, which don't write but for which
the second element is still a register.
If we started with malformed code, where we read from registers we've never written to, this piece of code could end up missing some registers. We'll assume that's not an issue here.
The result is a map from the register number as it appears in the code to a tuple of an index and a variable name. We'll have uses for both later on.
Reindexing jumps
When I was writing that code, I had recently read something about switches being more efficient on the JVM if they were denser. But how much, exactly, does it matter? What kind of speedup are we talking about?
Let us consider two functionally-identical loops:
(loop [i 0
j 1000]
(if (zero? j)
i
(case i
0 (recur 15 (dec j))
15 (recur 957 (dec j))
957 (recur 15376 (dec j))
15376 (recur 1234567890 (dec j))
1234567890 (recur 9876543210 (dec j))
9876543210 (recur 1 (dec j))
1 (recur 0 (dec j)))))
and
(loop [i 0
j 1000]
(if (zero? j)
i
(case i
0 (recur 1 (dec j))
1 (recur 2 (dec j))
2 (recur 3 (dec j))
3 (recur 4 (dec j))
4 (recur 5 (dec j))
5 (recur 6 (dec j))
6 (recur 0 (dec j)))))
The only difference between the two is that the second one has been reindexed. Benchmarking the first one yields a runtime of about 25 microseconds, whereas the second one yields about 2.2 microseconds, or a speedup of about 10x.
That's good enough for me1, so let's do it.
(defn make-index
[code]
(->> (find-entrypoints code)
(map-indexed vector)
(map (comp vec reverse))
(into {})))
Recall that find-entrypoints
returns a list of integers, which are the code
indices we jump to. This gives us a map from indices in the code to indices we
want in our case
statement.
Glue
We have some idea of what our code should look like, but there are still a few
missing pieces. First, when generating our code, we'll need a way to insert
either a constant (if the register we're reading was in :hoisted
) or a
variable name. Let's make a function for that:
(defn make-re-get
[hoisted registers]
(fn [r] (hoisted r (second (registers r)))))
This uses the fact that hoisted
is a Clojure map, and can thus be used as a
function, where the first argument is the key and the second is the value to
return if the key is not found. registers
is, similarly, a map, of which
the values were tuples of an index and a symbol, so second
will give us the
appropriate symbol.
Next, we'll need a way to recurse. The general structure of our code will be:
(loop [ip 0 ; instruction pointer
r1 0 ; register 1
r2 0 ; register 2
...]
(case ip
0 (let [r2 (+ r1 2)
...]
(recur ???))
...))
We need to be able to generate a recur
form that transmits all of the updated
values of the registers (that's easy enough: just repeat all the variable names
in the right order) as well as updating the instruction pointer to the point we
want to jump to. That last one will be an expression produced from the relevant
jump
instruction, so for now we'll just assume that we receive it as an
argument, yielding:
(defn make-recur
[registers]
(fn [update-ip]
`(recur ~update-ip
~@(->> registers
(map (fn [[i [idx sym]]] [idx sym]))
sort
(map second)))))
We use explicit currying here to receive the registers
argument first, and
return a function that captures it. We could improve the efficiency of that
method by precomputing the list of registers, but this is compiler code so
performance is not a consideration (in the context of this blog post).
Compiling segments
We now have all the pieces we need to compile a single segment. The first thing
we need to note is that we have two "types" of instructions: the ones that
change a register, and the ones that change control flow, i.e. both forms of
jump and the :return
instruction. We'll need to treat them separately.
Fortunately, per the definition of our segments (see last week's post
for the definition of find-segments
), we know that all of the instructions in
a segment except the last one are of the first type, and all segments always
end on an instruciton of the second type.
We can thus write our code to compile a segment:
(defn compile-segment-dense
[reindex re-get rec]
(fn [[ep segment]]
[(reindex ep)
`(let [~@(->> (butlast segment)
(mapcat (fn [[_ to :as op]]
[(re-get to)
(match op
[:loadl _ v] v
[:loadr _ from] (re-get from)
[[:bin :add] _ r1 r2] `(unchecked-add
~(re-get r1)
~(re-get r2))
[[:bin :not=] _ r1 r2] `(if (== ~(re-get r1)
~(re-get r2))
0 1))])))]
~(match (last segment)
[:jump to] (rec `(int ~(reindex to)))
[:jump-if-zero r to] (rec `(if (zero? ~(re-get r))
(int ~(reindex to))
(int ~(reindex (+ ep (count segment))))))
[:return r] (re-get r)))]))
We expect reindex
to be the result of make-index
, re-get
of
make-re-get
, and rec
of make-recur
. We use currying as a small syntax
nicety for the following code.
Putting it all together
At this point most of the work is done. We can just call all of the above functions together, and plug them into the overall shape we want.
(defn compile-rc-dense
[{:keys [code hoisted]}]
(let [registers (vars-from-reg code)
reindex (make-index code)
re-get (make-re-get hoisted registers)
rec (make-recur registers)
segments (->> (find-segments code)
(mapcat (compile-segment-dense reindex re-get rec)))
ip (gensym)]
`(fn []
(loop [~ip (int 0)
~@(->> registers
(map (fn [[i [idx sym]]] [idx sym]))
sort
(mapcat (fn [[_ sym]] [sym `(long 0)])))]
(case ~ip
~@segments)))))
Run over our sample code, this generates2:
t.core=> (compile-rc-dense (compile-register ast))
(fn []
(loop [G__3594 (int 0)
G__3588 (long 0)
G__3589 (long 0)
G__3590 (long 0)
G__3591 (long 0)
G__3592 (long 0)
G__3593 (long 0)]
(case G__3594
0 (let [G__3588 100
G__3589 1000
G__3590 (if (== 0 G__3589) 0 1)]
(recur (if (zero? G__3590) (int 3) (int 2))
G__3588 G__3589 G__3590 G__3591 G__3592 G__3593))
1 (let [G__3590 (if (== 0 G__3589) 0 1)]
(recur (if (zero? G__3590) (int 3) (int 2))
G__3588 G__3589 G__3590 G__3591 G__3592 G__3593))
2 (let [G__3591 (unchecked-add G__3588 4)
G__3592 (unchecked-add G__3591 G__3588)
G__3588 (unchecked-add G__3592 3)
G__3593 (unchecked-add G__3588 2)
G__3588 (unchecked-add G__3593 4)
G__3589 (unchecked-add -1 G__3589)]
(recur (int 1)
G__3588 G__3589 G__3590 G__3591 G__3592 G__3593))
3 (let [] G__3588))))
t.core=>
which is actually quite nice, apart from the horrible variable names.
Benchmarking that one yields a runtime of 7.8µs on my machine. As a reminder, last week's version benched at about 32µs, while our best Haskell interpreter was around 40µs.
Bonus: Breaking the rules
At this point, we've gone through a lot of pain to build up a pipeline for generating code. Per our "cheap", i.e. "no need to know another language", rule, we've only been considering producing Clojure code.
If we ignore that rule, however, and we're in a context where we can consider actually compiling our code, we can instead generate any other language. If we're shooting for generated ugly-but-fast code, it's hard to beat C as a compilation target. Not only is it much easier to compile to C than to assembly, it's also going to let us reuse all of the tricks the C compiler knows, as well as giving us some portability.
And, in the spirit of our broken rule, there are still a lot more people who know C than assembly.
One of the very nice features of the C language, compared to Clojure as a
target language, is that it actually does have jump
instructions, in the form
of label
and goto
. We can thus make a much more straightforward conversion:
we'll use C variables (real varying variables, this time) to represent
registers, and real jumps to represent jumps.
Because this is bonus content and this blog post is long enough already, I'm not going to explain this code in further details. Hopefully, it should be easy enough to read at this point, as the principles are mostly the same as our Clojure code generator above.
(defn compile-rc-c
[{:keys [code hoisted]} iter]
(let [target-register (fn [op]
(match op
[:loadl to _] to
[:loadr to _] to
[[:bin :add] to _ _] to
[[:bin :not=] to _ _] to
[:jump _] nil
[:jump-if-zero _ _] nil
[:return _] nil))
labels (->> code
(keep (fn [op]
(case (op 0)
:jump (op 1)
:jump-if-zero (op 2)
nil)))
set)
max-index (max (->> hoisted keys (reduce max))
(->> code (keep target-register) (reduce max)))
lines (fn [s] (->> s (interpose "\n") (apply str)))
reg (fn [r]
(or (hoisted r)
(str "r_" r)))
body (->> code
(map-indexed
(fn [idx op]
(str (when (labels idx) (str "LABEL_" idx ":\n"))
(match op
[:loadl r v] (str (reg r) " = " v ";")
[:loadr to from] (str (reg to) " = " (reg from) ";")
[[:bin :add] to arg1 arg2]
(str (reg to) " = " (reg arg1) " + " (reg arg2) ";")
[[:bin :not=] to arg1 arg2]
(str (reg to) " = " (reg arg1) " == " (reg arg2) " ? 0 : 1;")
[:jump to] (str "goto LABEL_" to ";")
[:jump-if-zero r to] (str "if (" (reg r) " == 0) { goto LABEL_" to "; }")
[:return r] (str "return " (reg r) ";")))))
lines)
c-code (lines
["#include <stdio.h>"
"#include <time.h>"
""
"long compiled_fn(void) {"
(->> (range (inc max-index))
(remove hoisted)
(map (fn [i] (str "long r_" i " = 0;")))
lines)
body
"}"
""
"int main() {"
"printf(\"%ld\\n\", compiled_fn());"
"clock_t start = clock();"
(str "for (int i = " iter "; i --> 0; ) {")
"compiled_fn();"
"}"
"printf(\"%ld\\n\", ((clock() - start) * 1000) / CLOCKS_PER_SEC);"
"}"
])
tmp (-> (shell/sh "mktemp" "-d")
:out
string/trim)
file (str tmp "/main.c")]
(spit file c-code)
tmp))
Running that generates code that looks like (with corrected indentation):
#include <stdio.h>
#include <time.h>
long compiled_fn(void) {
long r_0 = 0;
long r_1 = 0;
long r_3 = 0;
long r_5 = 0;
long r_6 = 0;
long r_9 = 0;
r_0 = 100;
r_1 = 1000;
LABEL_2:
r_3 = 0 == r_1 ? 0 : 1;
if (r_3 == 0) { goto LABEL_11; }
r_5 = r_0 + 4;
r_6 = r_5 + r_0;
r_0 = r_6 + 3;
r_9 = r_0 + 2;
r_0 = r_9 + 4;
r_1 = -1 + r_1;
goto LABEL_2;
LABEL_11:
return r_0;
}
int main() {
printf("%ld\n", compiled_fn());
clock_t start = clock();
for (int i = 1000000; i --> 0; ) {
compiled_fn();
}
printf("%ld\n", ((clock() - start) * 1000) / CLOCKS_PER_SEC);
}
Compiling and running that (cc main.c && ./a.out
) yields a runtime of about
7.9µs on my machine, which is eerily close to our fastest Clojure code above.
Clojure as fast as C?!
Hold on a sec, there. This is naive C compilation; notice the lack of -O3
? If
we turn that on, our C code vanishes and reports a runtime of zero, because the
C compiler is smart enough to see that we're never using the results of our
compiled_fn
, so it remove the whole loop.
This can be seen using godbolt, which clearly indicates that the whole loop is thrown away.
If we try to trick the compiler into not throwing that loop away, say by computing the sum of all these iterations and printing it:
/*...*/
int main() {
printf("%ld\n", compiled_fn());
clock_t start = clock();
long acc = 0;
for (int i = 1000000; i --> 0; ) {
acc += compiled_fn();
}
printf("%ld\n", acc);
printf("%ld\n", ((clock() - start) * 1000) / CLOCKS_PER_SEC);
}
that still does not work on -O1
, as the compiler is smart enough to
run it once and then multiply by the number of iterations.
Interestingly, with -O2
and -O3
, the compiler is suddenly no longer smart
enough to run it once and multiply. The total reported runtime is, on my
machine, around 0.07µs. That's a hundred times faster; it's fast enough to make
you wonder what it's actually doing, while slow enough (compared to the
reported 0ns of the -O1
) that you might believe it's actually doing
something. Also, doubling the number of iterations does double the reported
time.
Looking at the compiled code, we can see that it is indeed running
through all of those iterations. The content of the compiled_fn
has changed
substantially, though: the loop index (within the function) now increases by 8
instead of 1, and the body of the loop has been changed to just two operations:
shifting by 8 bits to the left, and then adding 3315. That gets us down to
about 0.25 operations per loop on average, explaining the huge speed bump.
What I take away from that is that C compilers are very smart, and if you need performance and your use-case allows for it, compiling to C can make a lot of sense.
Conclusion
This concludes my exploration of interpreters for now. You can find the whole series here.
As a quick review, here's what we covered:
- Basics of what an interpreter is, and how it compares to a compiler. And how they're actually very similar in a lot of ways, and the frontier between them is sometimes in the eye of the beholder.
- How to write a simple tree-walking interpreter. This is the simplest way to get going, and for a lot of applications this is all you need to know.
- Various techniques that may or may not improve the performance of a tree-walking interpreter, depending on the specifics of your use-case and programming language.
- Two types of lower-level abstract machines you might want to compile to, stack-based and register-based, as well as examples of how to do the compilation.
- How to make the interpreter for those machines really fast, compared to all of the tree-walking approaches.
- Why compiling to your host language may be a viable strategy in an interpreted language.
- How to compile to C.
This was a bit of a whirlwind tour and I'm not claiming that any of these topics was explored in depth, but it should be enough to get you started if you have a use-case for them.
When writing interpreters, as for any kind of code, always keep in mind that simpler is better. You should only consider moving down that list towards more performance if you actually have a need for that performance, as the complexity (and thus maintenance effort) also grows pretty quickly the further down you go.
In a more realistic context, one should not only benchmark the specific technique, but also how much time is spent overall in the thing you're trying to optimize. For example, here, if we can make the jumps in the switch 10x faster but only 0.1% of our runtime is spent in jumps, that's not going to matter, and the optimization would not be worth the increase in code complexity. In this context, I just want to write this code, and nobody is ever going to have to maintain it, so the stakes are much lower.
↩With added indentation and removed
↩clojure.core/
prefixes.