Cuddly, Octo-Palm Tree: Notes on Optimizing Clojure Code: Data Structures

13 February 2022

Notes on Optimizing Clojure Code: Data Structures

If you want to write efficient code, it's important to understand the performance characteristics of the building blocks you're using to write that code.

`clojure.core`: abstractions and performance

Clojure is a polymorphic language: every single collection function in the standard library is written against an abstraction (think Java interface, though that's not quite how that's implemented), rather than a concrete data structure. But, when thinking about performance, it is useful to distinguish between two types of polymorphism. Specifically, for the rest of this discussion, we'll pretend that there is a distinction between the "abstract" seq interface and the "concrete" collections (even though they are also abstractions).

Many functions in the standard library are built around the seq abstraction. Specifically, it is best to think of these functions as first transforming their input into a seq, and then acting on that seq, rather than as functions that act on collections directly. Other functions work on "concrete" collections directly.

The main semantic difference is that collection functions tend to return collections of the same type as their input collection, whereas seq functions always return seqs. This is explicitly mentioned in the function docstrings, but the reader has to know to look for that information.

The core seq functions are first and next, but most collection-processing functions in the clojure.core namespace are built on the seq abstraction (map, filter, partition, cons, last, etc.).¹

Collection functions include peek, pop, conj, assoc, disj, etc.

When writing performance-sensitive code, you should always check that you are using an appropriate collection function. For example, on a vector, both peek and last will return the last element, but:

(ns t.core
  (:require [criterium.core :as crit])
  (:gen-class))

(defn bench
  [f]
  (->> (crit/benchmark (f) {}) :mean first))

(defn -main
  [& args]
  (doseq [[n f] [["last" last] ["peek" peek]]
          size [10 100 1000]]
    (let [v (vec (range size))]
      (println (format "%s (%4d): %.2e (%d)" n size (bench #(f v)) (f v))))))

prints:

last (  10): 2.49e-07 (9)
last ( 100): 1.93e-06 (99)
last (1000): 1.94e-05 (999)
peek (  10): 9.39e-09 (9)
peek ( 100): 9.40e-09 (99)
peek (1000): 9.45e-09 (999)

The difference between the two comes from the fact that last is a seq function, which means that it is transforming the vector into a seq and then traversing all of the elements. That makes it \(O(n)\), where \(n\) is the number of elements in the vector. In contrast, peek is a collection function which, on vectors, accesses the last element by index directly. Vectors maintain their size, so getting the index of the last element is a constant-time operation, and accessing any element in a vector is an effectively constant-time operation, so the total is effectively constant-time.

Effectively constant-time is a complexity class sometimes used in computer science to denote algorithms that are not, strictly speaking, constant-time (\(O(1)\)), but are, for all intents and purposes, constant time.
For example, accessing the elements of a Clojure vector requires traversing a tree (actually a trie), the depth of which varies with the size of the vector. But because the tree has a branching factor of 32, and thus a depth of \(\log_{32}n\) (where \(n\) is the size), for all practical sizes that fit in memory on modern machines the factor, though variable, is bounded by a small (~7) constant.

Clojure data structures

Clojure only has a handful of core data structures:

Vectors are ordered collections of contiguously-indexed elements. Accessing an element by index, replacing an element by index, adding an element at the end or removing an element at the end are all effectively constant-time operations. The subvec function returns a narrower view on an existing vector in constant-time.
Maps are associative collections. Generally speaking, small maps are implemented as array-maps (Object[] arrays of key, val, key, val, etc.), and thus have linear lookup. Larger maps are hash-based, meaning that element access is (in the absence of hash collision) effectively constant time, just like vectors: adding a kv pair, removing a kv pair, and updating a kv pair are all effectively constant. Like vectors, maps know their size so getting the size is constant.
Sets are unordered collections with no index and no duplicate. They are effectively maps where, for each pair, the key is equal to the value; from a performance perspective, they behave exactly like that.
Lists are exactly what you would expect from a linked list: any element access requires traversing the list, and changing an element requires rewriting the list up to that element (tails are shared). Adding to the front of a list is constant-time, as is removing the first element.
Queues have amortized constant-time push (conj), peek and pop operations. Other operations are probably linear; you should not use a queue if your usage is not restricted to these three operations. Moreover, queues are the only data structure in this list without a literal syntax.

Clojure strings are java.lang.String instances with no wrapper, so have the same performance characteristics.

You should choose data structures that are fast for the operations you need to do on them. Generally speaking, Clojure is pretty good at making the fast operations also be the convenient ones, so clear, concise code will often match fast code here.

Queues & stacks

There are many situations that require either a queue or a stack. In Clojure, both lists and vectors can act as stacks indistinguishably if accessed only through the conj (push), peek and pop functions, though lists are substantially faster at it. If accessed through other means one can observe that lists grow the stack at the beginning while vectors grow the stack at the end.

The same three functions can also be used to interact with Clojure's semi-hidden persistent queues.

(ns t.core
  (:require [criterium.core :as crit])
  (:gen-class))

(defn bench
  [f]
  (->> (crit/benchmark (f) {}) :mean first))

(set! *unchecked-math* :warn-on-boxed)

(defn -main
  [& args]
  (doseq [[n mt] [["list" ()]
                  ["vector" []]
                  ["queue" clojure.lang.PersistentQueue/EMPTY]]
          size [10 100 1000]]
    (let [f #(loop [elems (range size)
                    stack-or-queue mt
                    total 0]
               (cond elems
                     (recur (next elems) (conj stack-or-queue (first elems)) total)
                     (empty? stack-or-queue)
                     total
                     :else
                     (recur nil (pop stack-or-queue) (+ (* 10 total)
                                                        (long (peek stack-or-queue))))))]
      (println (format "%s (%4d): %.2e (%d)" n size (bench f) (f))))))

yields:

list (  10): 5.80e-07 (9876543210)
list ( 100): 5.41e-06 (-5010226785451976982)
list (1000): 5.62e-05 (-5010226785451976982)
vector (  10): 1.23e-06 (9876543210)
vector ( 100): 1.29e-05 (-5010226785451976982)
vector (1000): 1.37e-04 (-5010226785451976982)
queue (  10): 1.46e-06 (123456789)
queue ( 100): 1.47e-05 (5010226785451976971)
queue (1000): 1.53e-04 (5010226785451976871)

The dominance of lists on vectors for this use-case hinges on the access pattern being strictly tied to the stack abstraction: if your algorithm is mostly using that collection as a stack but also sometimes needs to look at values in the middle, vectors may yield better performance overall.

Java data structures

Clojure is designed as a hosted language, which (among other things) means it makes it easy to use host data structures. In this series, I'm focusing on Java as a host, but the same idea applies to other hosts: when faced with a problem, do take some time to look through the host libraries for potential solutions.

This holds for any type of problem, but doubly-so for performance: there's a good chance that the host platform libraries will be able to take advantage of lower-level features to offer better performance than pure-Clojure solutions.

I'm obviously not going to go through the entire set of Java libraries here; instead, I'll just mention a handful of classes I have found useful for performance in the past:

java.util.ArrayList: it's often possible to rewrite local vectors to ArrayLists without too much of a hassle, and the performance increase is significant, especially if there are a lot of changes being done.
java.util.BitSet: when you need a set of integers, bitsets are hard to beat. A plain boolean[] can sometimes be even faster, but bitsets have a much nicer API if the total size is not known in advance.
java.util.PriorityQueue: many algorithms you'll find in the literature build upon a mutable priority heap. This is the standard Java implementation.
Arrays: the fastest types on the JVM are primitive types, and that holds just as true for collections as for numerics. Passing around arrays of numerics completely bypasses all of the boxing problems; even Object[] arrays can occasionally be useful. The bar is quite a bit higher though: whereas rewriting from a Clojure vector to an ArrayList is fairly easy (assuming the vector is used in a local, linear way), rewriting to an array is often a much more involved change.

In my own personal experience, despite trying multiple times, I do not remember a single instance where replacing a Clojure map with a Java mutable map (e.g. java.util.HashMap) resulted in a significant performance boost, but that's also one I'd keep in mind.

The obvious downside of all of these is that they give up on immutability, so there is definitely a tradeoff there. As always, when pushing for performance, do take a minute to think about whether the performance gain is worth the maintainability hit.

Finally, you should also consider looking for Java (host) libraries outside the standard library, if you have a generic enough problem that there may be existing implementations.

Conclusion

In many cases, Clojure makes it easy to use the appropriate data structure, in that the one with the most convenient API for the problem at hand is usually also the fastest one among Clojure collections. You should strive to internalize the performance characteristics of the core Clojure functions and data structures, and strive to write fast code by default in the many cases where there is no maintainability tradeoff.

In the hopefully rare cases where the Clojure data structures are not fast enough, dig around in your host libraries.

Appendix: Big-O: what is it and when is it relevant?

The standard way to discuss the performance characteristics of algorithms is the so-called "big-O" notation, which appears in such sentences as "sorting an array is \(O(n\log n)\)". This notation is standard because it is useful, but like any useful tool there are ways to misuse it. The best way to avoid that is to understand what it means.

Formally, \(O(f)\) is the set of all functions that have the same asymptotic behaviour as \(f\), up to a multiplicative constant. That is:

\[ f \in O(g) \iff \exists k \in \mathbb{R}:\lim_{x\to +\infty}\frac{f(x)}{g(x)} = k \]

Note that infinity is explicitly excluded as a possible value of \(k\) because infinity is not a member of \(\mathbb{R}\). Also note that the above definition is the "computer science" definition, where we make a number of additional assumptions compared to the more generalized "mathematical" definition (e.g. both \(f\) and \(g\) are always positive).

As a shorthand, people often abbreviate "\(f\) is a member of the set \(O(g)\)" to "\(f\) is \(O(g)\)", sometimes writing it directly as "\(f = O(g)\)". The notation makes a lot more sense if we remember this is about set membership, though.

Typeset math is cool, but what does that actually mean and why is that useful? One way to read the definition is to say that \(f\) is not growing more rapidly than \(kg\) for large values of \(x\).

Why is that useful? The core question that big-O notation attempts to answer is "how does my resource consumption change if I change my input? Specifically, if I make my input larger, how much larger does my resource consumption grow?" What it means to "make the input larger" will depend on the specific algorithm being studied, but is usually pretty obvious. Big-O notation generalizes to functions of multiple variables, which can be useful to describe some complex input types.

In most cases the resource we are concerned about is time (or, equivalently, CPU cycles), but the same tools can be used to analyze any other type of resource (open files, memory, network utilization, etc.).

For example, if you know that the time complexity of your algorithm is \(O(n^{2})\), you know that, if your input is large and you double its size, your run time is, at worst, going to be multiplied by four, because that's what would happen to \(kx^{2}\) if you double \(x\).

There are algebraic rules that apply to combinations of functions under big-O. They directly fall out of the definition, and are out of scope of this blog post, so I am not going to enumerate all of them here, but here are a couple examples:

Non-zero constant factors can be omitted: \(O(kf) = O(f)\).
In a sum, only the larger term matters; for example: \(O(x^2 + e^x) = O(e^x)\).

Here are a few additional notes:

If your input is small, big-O analysis does not tell you anything. (What "small" means is, again, dependent on the algorithm. And the resources available, I suppose.)
We call \(O(f)\) the complexity class of \(f\).
Big-O sets "nest": if a function is big-O of \(n^2\), it's also big-O of \(n^3\). In such a case, the constant \(k\) will be zero for all but the most nested complexity class that contains that function; that most nested one is sometimes called \(\Omega(f)\). In general, people want to know \(\Omega\), but \(O\) is easier to compute and/or prove.
A few common big-O sets have names: \(O(1)\) is called "constant", \(O(n)\) is called "linear", \(O(n^2)\) is called "quadratic", \(O(n^3)\) is called "cubic", \(O(\log n)\) is called "logarithmic", \(O(n\log n)\) is sometimes called "log-linear", and \(O(c^n)\) is called "exponential".
In most contexts, anything above \(O(n^3)\) is bad.

Finally, keep in mind that big-O notation is a tool meant to help you understand how the behaviour of a particular algorithm changes when its input size changes. It is not, in general, a good way to compare different algorithms, especially for reasonably small inputs: because big-O ignores constant factors, you can have "bad" big-O algorithms be faster than "good" ones.

As a concrete example, this is why Clojure uses array-based maps for small maps. Most map operations are linear (\(O(n)\)) on array-based maps, but when the size of the map is small, they can still be faster than the hash-based ones (which are roughly \(O(1)\)).

For the sake of simplicity, I am specifically ignoring the transducer-returning variant of all of these functions here. But it's a good rule of thumb that if a function has a transducer-returning variant, its non-transducer version is acting on seq rather than concrete collections.
↩

Tags: clojure

« Notes on Optimizing Clojure Code: Numerics Notes on Optimizing Clojure Code: Reflection »

Notes on Optimizing Clojure Code: Data Structures

`clojure.core`: abstractions and performance

Clojure data structures

Queues & stacks

Java data structures

Conclusion

Appendix: Big-O: what is it and when is it relevant?

Links

Recent Posts

Tags

Notes on Optimizing Clojure Code: Data Structures

clojure.core: abstractions and performance

Clojure data structures

Queues & stacks

Java data structures

Conclusion

Appendix: Big-O: what is it and when is it relevant?

Links

Recent Posts

Tags

`clojure.core`: abstractions and performance