Cuddly, Octo-Palm Tree: Notes on Optimizing Clojure Code: Type Hints

27 February 2022

Notes on Optimizing Clojure Code: Type Hints

Last week I wrote about how reflection works in Clojure and how to avoid it. The primary way to do that is through using appropriate type hints, but I didn't go very deep into how and what those are.

There's a surprising amount to know about them, and I don't know of a good unified resource on the topic, so here's my take on it.

The REL

Clojure has been designed for interactive use from the ground up. This is most visible in the REPL. Many languages have a shell, but lisps chose to call theirs a REPL because they have explicitly separate Read, Eval, Print and Loop steps. Most important is the separation between read and eval: read is a function that takes in a string of characters and returns language-level data structures, and eval is defined as acting on data structures.

Many languages have an "interpreter" for their interactive shell that is separate from their "compiler" for final program production. Clojure does not have such a separation: interactive REPL usage always goes through compilation, and it is the same compiler running over files.

Ultimately this means that the compilation process is almost exactly the same as an interactive session, with the printing step removed — a REL, if you will.

What matters to us for this discussion is that the compiler, Clojure's eval function, works on Clojure data structures.

Metadata

Clojure collections (as well as symbols) support a feature called metadata. You can think of it as an extra field at the object level, one that is not used for equality and hashcode computations, and is expected to point to either nil or a Clojure map.

You can attach metadata to an appropriate value with the with-meta function, and retrieve metadata with meta:

t.core=> (def t1 {:a 1 :b 2 :c "hello"})
#'t.core/t1
t.core=> (def t2 (with-meta t1 {:copied-from "t1"}))
#'t.core/t2
t.core=> (= t1 t2)
true
t.core=> (meta t1)
nil
t.core=> (meta t2)
{:copied-from "t1"}
t.core=>

Metadata can also be altered with vary-meta, which works like update, and can be printed by setting *print-meta* to true.

t.core=> (set! *print-meta* true)
true
t.core=> t1
{:a 1, :b 2, :c "hello"}
t.core=> t2
^{:copied-from "t1"} {:a 1, :b 2, :c "hello"}
t.core=>

The with-meta function can be used when constructing values from within Clojure code, but you may want to add metadata to forms at read time. This is especially useful for metadata intended for consumption by the Clojure compiler itself. As illustrated above, the reader macro for attaching metadata is ^, and it supports a few common shortcuts. The most important one in our context here is the one used for type hints: ^ followed by either a Java class or a string will attach the corresponding class (resolved with Class.forName in the latter case) as the :tag entry in the metadata map:

t.core=> (set! *print-meta* false)
false
t.core=> (let [m ^String {:value "hello"}]
    #_=>   [m (meta m)])
[{:value "hello"} {:tag java.lang.String}]
t.core=>

The more general ^{:key val...} notation lets one attach the entire metadata map at once, which is useful if you want to set multiple keys.

When the Clojure compiler sees a :tag metadata entry, it uses it as a type hint. But the values the compiler gets as inputs are not what we get directly in our code, so it's important to understand how metadata gets consumed, transformed and sometimes propagated by the compiler.

Compiling with metadata

How the compiler uses metadata really depends on what it's attached to. Metadata attached to nested values is just part of the surrounding value and is transparently transmitted through by the compiler. For example:

t.core=> (set! *print-meta* true)
true
t.core=> (def t {:outer ^java.util.ArrayList {:inner "map"}})
#'t.core/t
t.core=> (meta (:outer t))
{:tag java.util.ArrayList}
t.core=>

In such cases, there isn't a particularly good reason to prefer the reader macro to a more explicit call to with-meta, which has the advantage of making it clear the annotation is not meant for the compiler (since the metadata is then only created at evaluation time, i.e. after compilation, instead of by the reader).

So when does the compiler look at metadata?

There are essentially three cases that matter for performance:

Global definitions (def, defn, etc.).
Local definitions (let, loop, function arguments, etc.).
Arguments to the . special form, and thus by extension any .methodName call.

Through this discussion it is important to remember that the compiler looks at unevaluated forms. It does, however, resolve symbols. Specifically, every symbol that appears (unquoted) in a form the compiler is trying to compile needs to either be resolved the a local binding (introduced by let, function argument, etc.) or be resolved to a global, fully-qualified Var.

If a symbol is resolved to a global Var, the metadata on the Var is what the compiler will look at for the :tag field; if it is a local symbol, the compiler will look at the metadata on the symbol itself. For function calls (aka lists), the compiler will also look at the metadata on the first element.

Type hint propagation

When a local variable is defined, it can get a type hint in three ways:

The symbol itself is type hinted at definition time.
The initializer code form is type-hinted, in which case the type hint is propagated to the symbol.
The initializer is a literal primitive value, in which case the local is automatically type-hinted with that type.

In the case of primitive-hinted loop locals, the compiler will check that the corresponding recur arguments match the hinted type and warn if they don't and *warn-on-reflection* is set.

Here are a few examples (with somewhat elided error messages):

t.core=> (set! *warn-on-reflection* true)
true
t.core=> (let [s "hello there"] (.charAt s 3)) ; inferred
\l
t.core=> (let [s (apply str "hello")] (.charAt s 3)) ; not inferred
Reflection warning, call to method charAt can't be resolved.
\l
t.core=> (let [^String s (apply str "hello")] (.charAt s 3)) ; hinted
\l
t.core=> (let [s ^String (apply str "hello")] (.charAt s 3)) ; hinted
\l
t.core=> (loop [i 0 c 0]
    #_=>   (if (== i 5)
    #_=>     c
    #_=>     (recur (inc i) (+ i c)))) ; inferred
10
t.core=> (loop [i 0 c 0]
    #_=>   (if (== i 5)
    #_=>     c
    #_=>     (recur (inc i) (apply + i [c]))))
recur arg for primitive local: c is not matching primitive,
  had: Object, needed: long
Auto-boxing loop arg: c
10
t.core=> (loop [i 0 c 0]
    #_=>   (if (== i 5)
    #_=>     c
    #_=>     (recur (inc i) (long (apply + i [c])))))
10
t.core=>

When a local is hinted, regardless of how it happened, the hint is propagated to all uses of that local within its scope.

Global definitions are a bit more complicated. In a def form, the type hint must be applied to the symbol, and the compiler will propagate it to the corresponding Var:

t.core=> (def ^String s "hello")
#'t.core/s
t.core=> (meta #'s)
{:tag java.lang.String,
 :line 1,
 :column 1,
 :file "...",
 :name s,
 :ns #object[clojure.lang.Namespace 0x782f8453 "t.core"]}
t.core=> (.charAt s 4)
\o
t.core=>

Uses of the Var then behave as if they had that type hint, hence the lack of reflection warning on that last charAt call.

Functions are a bit more complicated. The hints must be applied to the arguments: the argument vector as a whole for the return value, and each symbol in the argument list for the arguments themselves. For example:

t.core=> (defn sqrt-long ^double [^long n] (Math/sqrt n))
#'t.core/sqrt-long
t.core=> (meta #'sqrt-long)
{:arglists ([n]),
 :line 1,
 :column 1,
 :file "...",
 :name sqrt-long,
 :ns #object[clojure.lang.Namespace 0x782f8453 "t.core"]}
t.core=> (meta (-> (meta #'sqrt-long) :arglists first))
{:tag double}
t.core=> (set! *print-meta* true)
true
t.core=> #'sqrt-long
^{:arglists (^double [^long n]),
  :line 1,
  :column 1,
  :file "...",
  :name sqrt-long,
  :ns #object[clojure.lang.Namespace 0x782f8453 "t.core"]}
#'t.core/sqrt-long
t.core=> (set! *print-meta* false)
false
t.core=> (sqrt-long 4)
2.0
t.core=>

The "return type" hint is propagated by the compiler "upward", which is why in the previous examples I had to use (apply str "hello") and (apply + i [c]), to defeat the existing type hints on, respectively, str and +.

Note that the argument type hints are not checked at the call site:

t.core=> (sqrt-long (apply long [4]))
2.0
t.core=>

The argument hints are used to tag the arguments within the function body itself, though.

Finally, we get to the . special form, and at this point there is not much more to say about it. As explained last week, in order to generate fast code, it needs that every single argument be inferred to the correct type, whether that's because of an explicit type hint on a local binding, an inferred one on a literal primitive (or String), or a propagated one from global Vars and function calls.

Classes and class names

In order to set a type hint, one needs to know the name of the class to hint for. In most cases this will be an explicit Java class, and in those cases it's better to use the class itself as the hint, as in ^java.lang.String or java.util.List (interfaces and abstract classes are valid "classes" too for the purposes of this post). For brevity, you can import the relevant classes in your ns declaration; java.lang.* is always imported, which is why I could get away with ^String in the examples above.

However, not all Java types can be represented neatly by a Class instance like that. Specifically, primitives and arrays don't have literal class syntax in Java, and thus neither do they in Clojure.

Clojure does not support the full complement of Java primitives as type hints; it only supports long and double. For those, the Clojure compiler will accept the objects clojure.core/long and clojure.core/double (which are actually Clojure functions) as valid values for the :tag field. Since clojure.core is required with no alias by default, we were able to just use ^double and ^long in our Math/sqrt wrapper above.

EDIT: As Alex Miller pointed out on Reddit, the above paragraph is incorrect. The symbols long, double, int, etc. are accepted by the Clojure compiler at type hints when resolving interop calls; it's only when it comes to defining Clojure functions that the list of supported primitives is limited to double and long.
Moreover, there seems to be a difference with how type hints are resolved on Vars versus other places: in other places, they are just read, whereas on Vars they are also evaluated. Hence the ^long tag will, in most places, not resolve to the clojure.core/long function, but just remain the plain long symbol, which is recognized by the compiler as hinting for a primitive long.
For the particular case of type-hinting functions, this is why it is recommended to hint the argument vector (read) instead of the Var (evaluated), as the compiler does not know what to do with clojure.core/long as a type hint.

Finally, arrays. The JVM has all sorts of weird rules around arrays, and describing their classes is no exception to that. Clojure tries to make things a little bit easier by providing readable shorthands for a few common cases. For example, to type hint a binding as an array of doubles, one can use the value doubles.

But as soon as you get into multidimensional arrays, you have to fall back to the actual JVM names for their classes, so it's worth knowing about them. The gist of it is you need to use a string type hint (which will be resolved with Class.forName), and write it as a number of [ equal to your nesting level followed by, for primitives, a single letter, or, for objects, the class name for the type of the elements. Note that, unlike Java collections, Java arrays are not type-erased, so you can have specific classes here.¹

The easiest way to retrieve the class name for a type of array is to use the make-array function in a Clojure REPL:

t.core=> (.getName (class (make-array String 1 2 3)))
"[[[Ljava.lang.String;"
t.core=> (.getName (class (make-array java.util.function.Function 1 2 3)))
"[[[Ljava.util.function.Function;"
t.core=> (.getName (class (make-array Integer/TYPE 1 2 3)))
"[[[I"
t.core=> (.getName (class (make-array Long/TYPE 0 0 0)))
"[[[J"
t.core=>

Note that you need getName here; getCanonicalName will not give you a name that works with Class.forName.

The make-array function takes a type and any number of "sizes" and returns an array with the given dimensions. Note that the array type name is independent of the sizes, so any (positive int) value is fine; what matters here is the number of arguments.

For arrays of primitives, we still have the issue that there is no good way to represent the class of a primitive value (because they don't have one), but we can use the special values of NonPrimitiveWrapper/TYPE instead, as illustrated for int and long.

Code generation

If you are generating code, whether through macros at compile time or for explicit consumption by eval at run time, you need to bear in mind that what matters is the metadata on the forms seen by eval, not necessarily the ones seen by the reader. Therefore, sometimes the reader macro is not going to be appropriate.

In particular, metadata reader macros do not play nicely with quasiquotes (backticks), so you'll need to apply them explicitly with with-meta in generated code.

Here is an example, taken from code I wrote for last year's Advent of Code:

(defn to-bindings
  ([m] (to-bindings m {:bindings (), :rbindings {}}))
  ([m state]
   (match m
     [:return v] [v state]
     [:bind ma f] (let [[v state] (to-bindings ma state)]
                    (to-bindings (f v) state))
     [:expr expr & t] (if-let [sym (get (:rbindings state) expr)]
                        [sym state]
                        (let [sym (with-meta (symbol (str "r-" (count (:rbindings state))))
                                             (when (seq t)
                                               {:tag (first t)}))]
                          [sym (-> state
                                   (update :bindings concat [sym expr])
                                   (update :rbindings assoc expr sym))])))))

You can see the explicit construction of a symbol and its associated metadata. The constructed pair will be later on used as a binding in a let form.

Hints & inference

Up to Clojure 1.10 (this is changing in 1.11), you are not allowed to type hint a primitive local that has been inferred. For example:

t.core=> (let [arr ^longs (make-array Long/TYPE 3)
    #_=>       c (aget arr 1)]
    #_=>   (aset arr 0 ^long c))
Syntax error (UnsupportedOperationException) compiling at (...).
Can't type hint a primitive local

t.core=>

The issue here is that c is already inferred to be a long, because it is an element of the array arr which is itself hinted as an array of longs.

This is usually not a problem in manually-written code, where you typically want to have as few type-hints as possible anyway. In this case, one can just choose to leave out the annotation on the last line:

t.core=> (let [arr ^longs (make-array Long/TYPE 3)
    #_=>       c (aget arr 1)]
    #_=>   (aset arr 0 c))
0
t.core=>

and everything works out (without reflection).

But it can be annoying in generated code, where the type inference can come from optional code and therefore you have to choose between:

Having your code only compile for some code paths, obviously not a great alternative.
Having your code be slow on some code paths due to reflection. This works but is obviously annoying.
Adding more code to your code generator to keep track of which bits of code you have emitted that could trigger inference, and only generate the type hint if it is actually needed. This can be a non-trivial amount of extra complexity compared to always emitting the type hint.

As of Clojure 1.11, it will be valid to type hint a primitive local, as long as the type hint matches the inferred type. Which arguably makes sense.

Conclusion

In most presentations, type hints are quickly skimmed over, because they're almost always a means to an end. And while they are overall not very complicated, there are some nuances to them. I hope this blog will help shed some light on the topic and give you a bit more confidence on how to approach writing them.

There is not much benefit to having a "real" type on an array if it is being used entirely from Clojure, because, in reflection-free code, all uses of the corresponding objects will be type-specific anyway. But if the goal is to pass that array to Java code, it may be important for it to have the specific, expected type. In that case on should also bear in mind that there is no hierarchy of array types (i.e. arrays of A is not a supertype of array of B regardless of the type relationship of A and B).
↩

Tags: clojure

« Notes on Optimizing Clojure Code: Reflection Notes on Optimizing Clojure Code: Arrays »