Notes on Optimizing Clojure Code: Type Hints
Last week I wrote about how reflection works in Clojure and how to avoid it. The primary way to do that is through using appropriate type hints, but I didn't go very deep into how and what those are.
There's a surprising amount to know about them, and I don't know of a good unified resource on the topic, so here's my take on it.
The REL
Clojure has been designed for interactive use from the ground up. This is most
visible in the REPL. Many languages have a shell, but lisps chose to call
theirs a REPL because they have explicitly separate Read, Eval, Print and Loop
steps. Most important is the separation between read
and eval
: read
is a
function that takes in a string of characters and returns language-level data
structures, and eval
is defined as acting on data structures.
Many languages have an "interpreter" for their interactive shell that is separate from their "compiler" for final program production. Clojure does not have such a separation: interactive REPL usage always goes through compilation, and it is the same compiler running over files.
Ultimately this means that the compilation process is almost exactly the same as an interactive session, with the printing step removed — a REL, if you will.
What matters to us for this discussion is that the compiler, Clojure's eval
function, works on Clojure data structures.
Metadata
Clojure collections (as well as symbols) support a feature called metadata.
You can think of it as an extra field at the object level, one that is not used
for equality and hashcode computations, and is expected to point to either
nil
or a Clojure map.
You can attach metadata to an appropriate value with the with-meta
function,
and retrieve metadata with meta
:
t.core=> (def t1 {:a 1 :b 2 :c "hello"})
#'t.core/t1
t.core=> (def t2 (with-meta t1 {:copied-from "t1"}))
#'t.core/t2
t.core=> (= t1 t2)
true
t.core=> (meta t1)
nil
t.core=> (meta t2)
{:copied-from "t1"}
t.core=>
Metadata can also be altered with vary-meta
, which works like update
, and
can be printed by setting *print-meta*
to true
.
t.core=> (set! *print-meta* true)
true
t.core=> t1
{:a 1, :b 2, :c "hello"}
t.core=> t2
^{:copied-from "t1"} {:a 1, :b 2, :c "hello"}
t.core=>
The with-meta
function can be used when constructing values from within
Clojure code, but you may want to add metadata to forms at read
time. This
is especially useful for metadata intended for consumption by the Clojure
compiler itself. As illustrated above, the reader macro for attaching metadata
is ^
, and it supports a few common shortcuts. The most important one in our
context here is the one used for type hints: ^
followed by either a Java
class or a string will attach the corresponding class (resolved with
Class.forName
in the latter case) as the :tag
entry in the metadata map:
t.core=> (set! *print-meta* false)
false
t.core=> (let [m ^String {:value "hello"}]
#_=> [m (meta m)])
[{:value "hello"} {:tag java.lang.String}]
t.core=>
The more general ^{:key val...}
notation lets one attach the entire metadata
map at once, which is useful if you want to set multiple keys.
When the Clojure compiler sees a :tag
metadata entry, it uses it as a type
hint. But the values the compiler gets as inputs are not what we get directly
in our code, so it's important to understand how metadata gets consumed,
transformed and sometimes propagated by the compiler.
Compiling with metadata
How the compiler uses metadata really depends on what it's attached to. Metadata attached to nested values is just part of the surrounding value and is transparently transmitted through by the compiler. For example:
t.core=> (set! *print-meta* true)
true
t.core=> (def t {:outer ^java.util.ArrayList {:inner "map"}})
#'t.core/t
t.core=> (meta (:outer t))
{:tag java.util.ArrayList}
t.core=>
In such cases, there isn't a particularly good reason to prefer the reader
macro to a more explicit call to with-meta
, which has the advantage of making
it clear the annotation is not meant for the compiler (since the metadata is
then only created at evaluation time, i.e. after compilation, instead of by
the reader).
So when does the compiler look at metadata?
There are essentially three cases that matter for performance:
- Global definitions (
def
,defn
, etc.). - Local definitions (
let
,loop
, function arguments, etc.). - Arguments to the
.
special form, and thus by extension any.methodName
call.
Through this discussion it is important to remember that the compiler looks at
unevaluated forms. It does, however, resolve symbols. Specifically, every
symbol that appears (unquoted) in a form the compiler is trying to compile
needs to either be resolved the a local binding (introduced by let
,
function argument, etc.) or be resolved to a global, fully-qualified Var.
If a symbol is resolved to a global Var, the metadata on the Var is what the
compiler will look at for the :tag
field; if it is a local symbol, the
compiler will look at the metadata on the symbol itself. For function calls
(aka lists), the compiler will also look at the metadata on the first element.
Type hint propagation
When a local variable is defined, it can get a type hint in three ways:
- The symbol itself is type hinted at definition time.
- The initializer code form is type-hinted, in which case the type hint is propagated to the symbol.
- The initializer is a literal primitive value, in which case the local is automatically type-hinted with that type.
In the case of primitive-hinted loop
locals, the compiler will check that the
corresponding recur
arguments match the hinted type and warn if they don't
and *warn-on-reflection*
is set.
Here are a few examples (with somewhat elided error messages):
t.core=> (set! *warn-on-reflection* true)
true
t.core=> (let [s "hello there"] (.charAt s 3)) ; inferred
\l
t.core=> (let [s (apply str "hello")] (.charAt s 3)) ; not inferred
Reflection warning, call to method charAt can't be resolved.
\l
t.core=> (let [^String s (apply str "hello")] (.charAt s 3)) ; hinted
\l
t.core=> (let [s ^String (apply str "hello")] (.charAt s 3)) ; hinted
\l
t.core=> (loop [i 0 c 0]
#_=> (if (== i 5)
#_=> c
#_=> (recur (inc i) (+ i c)))) ; inferred
10
t.core=> (loop [i 0 c 0]
#_=> (if (== i 5)
#_=> c
#_=> (recur (inc i) (apply + i [c]))))
recur arg for primitive local: c is not matching primitive,
had: Object, needed: long
Auto-boxing loop arg: c
10
t.core=> (loop [i 0 c 0]
#_=> (if (== i 5)
#_=> c
#_=> (recur (inc i) (long (apply + i [c])))))
10
t.core=>
When a local is hinted, regardless of how it happened, the hint is propagated to all uses of that local within its scope.
Global definitions are a bit more complicated. In a def
form, the type hint
must be applied to the symbol, and the compiler will propagate it to the
corresponding Var:
t.core=> (def ^String s "hello")
#'t.core/s
t.core=> (meta #'s)
{:tag java.lang.String,
:line 1,
:column 1,
:file "...",
:name s,
:ns #object[clojure.lang.Namespace 0x782f8453 "t.core"]}
t.core=> (.charAt s 4)
\o
t.core=>
Uses of the Var then behave as if they had that type hint, hence the lack of
reflection warning on that last charAt
call.
Functions are a bit more complicated. The hints must be applied to the arguments: the argument vector as a whole for the return value, and each symbol in the argument list for the arguments themselves. For example:
t.core=> (defn sqrt-long ^double [^long n] (Math/sqrt n))
#'t.core/sqrt-long
t.core=> (meta #'sqrt-long)
{:arglists ([n]),
:line 1,
:column 1,
:file "...",
:name sqrt-long,
:ns #object[clojure.lang.Namespace 0x782f8453 "t.core"]}
t.core=> (meta (-> (meta #'sqrt-long) :arglists first))
{:tag double}
t.core=> (set! *print-meta* true)
true
t.core=> #'sqrt-long
^{:arglists (^double [^long n]),
:line 1,
:column 1,
:file "...",
:name sqrt-long,
:ns #object[clojure.lang.Namespace 0x782f8453 "t.core"]}
#'t.core/sqrt-long
t.core=> (set! *print-meta* false)
false
t.core=> (sqrt-long 4)
2.0
t.core=>
The "return type" hint is propagated by the compiler "upward", which is why in
the previous examples I had to use (apply str "hello")
and (apply + i [c])
,
to defeat the existing type hints on, respectively, str
and +
.
Note that the argument type hints are not checked at the call site:
t.core=> (sqrt-long (apply long [4]))
2.0
t.core=>
The argument hints are used to tag the arguments within the function body itself, though.
Finally, we get to the .
special form, and at this point there is not much
more to say about it. As explained last week, in order to generate fast code,
it needs that every single argument be inferred to the correct type, whether
that's because of an explicit type hint on a local binding, an inferred one on
a literal primitive (or String), or a propagated one from global Vars and
function calls.
Classes and class names
In order to set a type hint, one needs to know the name of the class to hint
for. In most cases this will be an explicit Java class, and in those cases it's
better to use the class itself as the hint, as in ^java.lang.String
or
java.util.List
(interfaces and abstract classes are valid "classes" too for
the purposes of this post). For brevity, you can import
the relevant classes
in your ns
declaration; java.lang.*
is always imported, which is why I
could get away with ^String
in the examples above.
However, not all Java types can be represented neatly by a Class instance like that. Specifically, primitives and arrays don't have literal class syntax in Java, and thus neither do they in Clojure.
Clojure does not support the full complement of Java primitives as type hints;
it only supports long
and double
. For those, the Clojure compiler will
accept the objects clojure.core/long
and clojure.core/double
(which are
actually Clojure functions) as valid values for the :tag
field. Since
clojure.core
is required with no alias by default, we were able to just use
^double
and ^long
in our Math/sqrt
wrapper above.
EDIT: As Alex Miller pointed out on Reddit, the above paragraph is incorrect. The symbols
long
,double
,int
, etc. are accepted by the Clojure compiler at type hints when resolving interop calls; it's only when it comes to defining Clojure functions that the list of supported primitives is limited todouble
andlong
.Moreover, there seems to be a difference with how type hints are resolved on Vars versus other places: in other places, they are just read, whereas on Vars they are also evaluated. Hence the
^long
tag will, in most places, not resolve to theclojure.core/long
function, but just remain the plainlong
symbol, which is recognized by the compiler as hinting for a primitivelong
.For the particular case of type-hinting functions, this is why it is recommended to hint the argument vector (read) instead of the Var (evaluated), as the compiler does not know what to do with
clojure.core/long
as a type hint.
Finally, arrays. The JVM has all sorts of weird rules around arrays, and
describing their classes is no exception to that. Clojure tries to make things
a little bit easier by providing readable shorthands for a few common cases.
For example, to type hint a binding as an array of doubles, one can use the
value doubles
.
But as soon as you get into multidimensional arrays, you have to fall back to
the actual JVM names for their classes, so it's worth knowing about them. The
gist of it is you need to use a string type hint (which will be resolved with
Class.forName
), and write it as a number of [
equal to your nesting level
followed by, for primitives, a single letter, or, for objects, the class name
for the type of the elements. Note that, unlike Java collections, Java arrays
are not type-erased, so you can have specific classes here.1
The easiest way to retrieve the class name for a type of array is to use the
make-array
function in a Clojure REPL:
t.core=> (.getName (class (make-array String 1 2 3)))
"[[[Ljava.lang.String;"
t.core=> (.getName (class (make-array java.util.function.Function 1 2 3)))
"[[[Ljava.util.function.Function;"
t.core=> (.getName (class (make-array Integer/TYPE 1 2 3)))
"[[[I"
t.core=> (.getName (class (make-array Long/TYPE 0 0 0)))
"[[[J"
t.core=>
Note that you need getName
here; getCanonicalName
will not give you a
name that works with Class.forName
.
The make-array
function takes a type and any number of "sizes" and returns an
array with the given dimensions. Note that the array type name is independent
of the sizes, so any (positive int
) value is fine; what matters here is
the number of arguments.
For arrays of primitives, we still have the issue that there is no good way to
represent the class of a primitive value (because they don't have one), but we
can use the special values of NonPrimitiveWrapper/TYPE
instead, as
illustrated for int
and long
.
Code generation
If you are generating code, whether through macros at compile time or for
explicit consumption by eval
at run time, you need to bear in mind that what
matters is the metadata on the forms seen by eval
, not necessarily the ones
seen by the reader. Therefore, sometimes the reader macro is not going to be
appropriate.
In particular, metadata reader macros do not play nicely with quasiquotes
(backticks), so you'll need to apply them explicitly with with-meta
in
generated code.
Here is an example, taken from code I wrote for last year's Advent of Code:
(defn to-bindings
([m] (to-bindings m {:bindings (), :rbindings {}}))
([m state]
(match m
[:return v] [v state]
[:bind ma f] (let [[v state] (to-bindings ma state)]
(to-bindings (f v) state))
[:expr expr & t] (if-let [sym (get (:rbindings state) expr)]
[sym state]
(let [sym (with-meta (symbol (str "r-" (count (:rbindings state))))
(when (seq t)
{:tag (first t)}))]
[sym (-> state
(update :bindings concat [sym expr])
(update :rbindings assoc expr sym))])))))
You can see the explicit construction of a symbol and its associated metadata.
The constructed pair will be later on used as a binding in a let
form.
Hints & inference
Up to Clojure 1.10 (this is changing in 1.11), you are not allowed to type hint a primitive local that has been inferred. For example:
t.core=> (let [arr ^longs (make-array Long/TYPE 3)
#_=> c (aget arr 1)]
#_=> (aset arr 0 ^long c))
Syntax error (UnsupportedOperationException) compiling at (...).
Can't type hint a primitive local
t.core=>
The issue here is that c
is already inferred to be a long
, because it is an
element of the array arr
which is itself hinted as an array of longs.
This is usually not a problem in manually-written code, where you typically want to have as few type-hints as possible anyway. In this case, one can just choose to leave out the annotation on the last line:
t.core=> (let [arr ^longs (make-array Long/TYPE 3)
#_=> c (aget arr 1)]
#_=> (aset arr 0 c))
0
t.core=>
and everything works out (without reflection).
But it can be annoying in generated code, where the type inference can come from optional code and therefore you have to choose between:
- Having your code only compile for some code paths, obviously not a great alternative.
- Having your code be slow on some code paths due to reflection. This works but is obviously annoying.
- Adding more code to your code generator to keep track of which bits of code you have emitted that could trigger inference, and only generate the type hint if it is actually needed. This can be a non-trivial amount of extra complexity compared to always emitting the type hint.
As of Clojure 1.11, it will be valid to type hint a primitive local, as long as the type hint matches the inferred type. Which arguably makes sense.
Conclusion
In most presentations, type hints are quickly skimmed over, because they're almost always a means to an end. And while they are overall not very complicated, there are some nuances to them. I hope this blog will help shed some light on the topic and give you a bit more confidence on how to approach writing them.
There is not much benefit to having a "real" type on an array if it is being used entirely from Clojure, because, in reflection-free code, all uses of the corresponding objects will be type-specific anyway. But if the goal is to pass that array to Java code, it may be important for it to have the specific, expected type. In that case on should also bear in mind that there is no hierarchy of array types (i.e. arrays of A is not a supertype of array of B regardless of the type relationship of A and B).
↩