29 August 2021

Data-first API design

I used to be a Clojure programmer. I've been using it as my main programming language from the first time I took a serious look at it in early 2012 up until the end of 2019, when I decided to check out the competition.

I joined a company with a strong static typing culture in what was meant to be a full-time Haskell position.1 For the past three years, I've tried really hard to understand the benefits of static typing and category theory, and, while I've learned a lot, I still tend to fall back to Clojure whenever I contemplate starting a new project. That got me thinking: why?

Today, I want to talk about one of the reasons, probably the biggest one. Ironically enough, it's not really language-specific, but the culture in the Clojure community seems to encourage it a lot more than other language communities I've seen.

As the title suggests, I am talking about data-first APIs.

What do you mean, data?

The software engineering community as a whole is suffering from a bad case of jargon fragmentation. What Clojure developers mean when they say "data" is not readily translated to what other communities hear. Or, rather, it is readily translated to a slightly different notion.

This is not unique to "data"; a lot of seemingly technical words have slightly (or sometimes completely) different meanings depending on the community you're a part of.

In Clojure, "data" (hereafter "Clojure data") means any combination of a handful of core data types:

  • Atomic data types (numbers, symbols, keywords, strings, booleans, nil).
  • Lists.
  • Vectors.
  • Maps.
  • Sets.

The latter four are collection types which can nest arbitrarily with each other and have heterogenous content.

Of course, the notion of a data type is itself somewhat language-dependent, but hopefully we can gloss over that in this discussion.

What makes those types so useful at representing Clojure data is the homoiconicity of the language: all of these types have a direct, unambiguous mapping to a literal (text-based) representation, which means that Clojure data, as in arbitrary combinations of those types, can be trivially serialized to text. This in turn means it can be saved to disk, sent over a network, and more importantly deserialized to the same "value" on the other side (even across Clojure versions).

They also have sane equality semantics that the programmer can rely on, as well as a host of functions in the Clojure standard library to manipulate them in useful ways. And because the majority of most Clojure programs consists of manipulating Clojure data, most Clojure programmers are pretty familiar with a lot of these functions and therefore find manipulating such data easy.

Data-first API

The Clojure community has an aphorism that goes:

Data > Function > Macro.

What this means is that, as a general rule, one should design APIs (be they network APIs or library APIs) to use data where possible, functions where data cannot be used, and macros only to achieve things that are impossible to build with functions and data. (This makes macros fairly rare in Clojure code overall.)

It also means that, ideally, if a function is provided, it's only a convenience wrapper around a data representation of the API, and if a macro is provided, it's only a convenience wrapper around a set of functions.

So the ideal API has one function that takes in a big blob of data, and then optionally a few functions to, say, simplify the API for some common, simple cases, and finally perhaps a macro or two to simplify usage of these functions in the most common cases.

Parsing CLI args

As a concrete example of the data-first approach, we can take a look at the tools.cli Clojure library. From its README, here is an example usage:

(ns my.program
  (:require [clojure.tools.cli :refer [parse-opts]])
  (:gen-class))

(def cli-options
  ;; An option with a required argument
  [["-p" "--port PORT" "Port number"
    :default 80
    :parse-fn #(Integer/parseInt %)
    :validate [#(< 0 % 0x10000) "Must be a number between 0 and 65536"]]
   ;; A non-idempotent option (:default is applied first)
   ["-v" nil "Verbosity level"
    :id :verbosity
    :default 0
    :update-fn inc] ; Prior to 0.4.1, you would have to use:
                   ;; :assoc-fn (fn [m k _] (update-in m [k] inc))
   ;; A boolean option defaulting to nil
   ["-h" "--help"]])

(defn -main [& args]
  (parse-opts args cli-options))

The library essentially exposes a single function (parse-opt), which takes in the actual parameters the user typed and a data representation of what options are expected. That function in turn returns data representing the parsed configuration.

How else would one go about designing a library to parse CLI arguments? Let's take a look at a Haskell one: optparse-applicative. The README also starts with an example usage:

import Options.Applicative
import Data.Semigroup ((<>))

data Sample = Sample
  { hello      :: String
  , quiet      :: Bool
  , enthusiasm :: Int }

sample :: Parser Sample
sample = Sample
      <$> strOption
          ( long "hello"
         <> metavar "TARGET"
         <> help "Target for the greeting" )
      <*> switch
          ( long "quiet"
         <> short 'q'
         <> help "Whether to be quiet" )
      <*> option auto
          ( long "enthusiasm"
         <> help "How enthusiastically to greet"
         <> showDefault
         <> value 1
         <> metavar "INT" )

main :: IO ()
main = greet =<< execParser opts
  where
    opts = info (sample <**> helper)
      ( fullDesc
     <> progDesc "Print a greeting for TARGET"
     <> header "hello - a test for optparse-applicative" )

greet :: Sample -> IO ()
greet (Sample h False n) = putStrLn $ "Hello, " ++ h ++ replicate n '!'
greet _ = return ()

At first glance, this looks very similar. The definition of the sample parser is similarly declarative and, like cli-options in Clojure, essentially lays out the options in a fairly readable format. As a bonus, we get to define the shape of our option map (here Sample), and get static type checking on it.

The main focus of the optparse-applicative library is to provide a nice API to create parser descriptions, and it succeeds at that. Using those descriptions to then parse the actual CLI args is fairly trivial. One does need to understand how the Applicative typeclass works in Haskell to use it, but that's a fair assumption for a Haskell library, given how central that typeclass is to the language.

Before we move on, let's look at a third library targeting the same space, this time in the Scala language, scopt:

import java.io.File
case class Config(
    foo: Int = -1,
 // ...
    kwargs: Map[String, String] = Map())

import scopt.OParser
val builder = OParser.builder[Config]
val parser1 = {
  import builder._
  OParser.sequence(
    programName("scopt"),
    head("scopt", "4.x"),
    // option -f, --foo
    opt[Int]('f', "foo")
      .action((x, c) => c.copy(foo = x))
      .text("foo is an integer property"),
    // more options here...
  )
}

// OParser.parse returns Option[Config]
OParser.parse(parser1, args, Config()) match {
  case Some(config) =>
    // do something
  case _ =>
    // arguments are bad, error message will have been displayed
}

Again, we have a layout that broadly resembles our previous examples. One of the big differences here is the action field, which contains a function that will be run with the option. In a language that does not track effects, this can be a dangerous (or useful) feature.

Also of note, we seem to need to call OParser.sequence to represent a sequence of things, depite the Scala language having many sequential types in its standard library.

Advantages of data-first

On the face of it, these three libraries may seem broadly similar. For the simple use-cases illustrated in the README, ignoring the syntactic differences between their host languages, there doesn't seem to be a huge differentiating factor between them.

One small advantage I could claim for the Clojure one is that it's only using the default Clojure syntax and types for Clojure data: as a programmer, I know how that syntax works and can focus on the semantics the library assigns to that data. On the Haskell and Scala sides I would have to understand all of the functions and objects involved, how they can or cannot be combined, etc. While I do believe there is some truth in that argument, I also do not think it's a very convincing one, especially given that it seems intricately tied to the lack of type safety of the Clojure approach. Depending on individual sensitivities, it's not clear that that tradeoff is always worth it.

But what if the use-case becomes more complex? Some applications can get away with a straight-up CLI definition like that, but some have different requirements.

The main advantage of a data-first approach is flexibility. Not so much for the API deisgner, but for programs using the API.

Say you needed to enforce a rule that no CLI argument can be longer than 16 characters. It's immediately obvious to any Clojure programmer how to write a function that walks down cli-options and checks that. Here it is:

(defn check-arg-length
  [max-length opts]
  (every? (fn [[_ long-opt]] (<= (count long-opt) 16))
          opts))

Is it obvious to you how to do that on either the Haskell or Scala one? How do you iterate over the options in a Parser Sample or an OParser[Config]?

The designer of the tools.cli package did not predict this use-case. It's not that they have better foresight than the designers of the other two, but by choosing a data-first approach, they have inherently given flexibility to their users.

More importantly, as a user, I did not need to predict that use-case either. I just used the API, which required me to have my configuration as data, and I can use data in all sorts of other ways than just to feed it into that library.

Environment variables

Most applications support (or want to support) more than one source of configuration. The typical set is environment variables, CLI args and configuration file, with well-defined priorities between them if the same option is set at different levels. Ideally, you'd want the set of options that can be configured through each of these three input vectors to be the same, with similar semantics and ideally similar names. Perhaps the environment variable is the SCREAM_CASE version of the kebab-case long-form CLI arg, and the config file uses snake_case because it's in JSON and someone forgot JSON is not JavaScript.

Ideally, in that situation, you would want one "source of truth" and then derive the other two (or all three) from that. Again, it's perfectly obvious to a Clojure programmer how to navigate cli-options and use it to scan environment variables, or map it to any number of configuration formats (YAML, JSON, etc.) It's not all that obvious to me how to do that with any of the other two.

In fact, a few months back I tried to do just that with a Scala application using scopt, and the only way it was even possible to get at the option names from a parser was to use undocumented (though thankfully not explicitly private) APIs and objects. The code looked something like:

private def parseConfig(args: collection.Seq[String]): Option[Config] =
   // was: configParser().parse(args, Config.Empty)
   configParser().parse(args, configFromEnv(getEnvVar))

trait ExposeData[C] {
    def optionDefs: Seq[scopt.OptionDef[_, C]]
}

// This existed before, but I added the `with` extension
private def configParser(): scopt.OptionParser[Config]
                            with ExposeData[Config] =
  new scopt.OptionParser[Config]("app-name") with ExposeData[Config] {
    // options, as per README above
    // ...

    // Added this:
    override def optionDefs = options.toSeq
    // options is, thankfully, protected and not private
}

// ...

private def configFromEnv(): Config = {
  val ods = configParser().optionDefs.map(_.optional)
  val cliFromEnvVars = ods
    .map(_.fullName)
    .filter(_.length > 0)
    .filter(_ != "--help")
    .map(cli => (cli, cli.toUpperCase
                         .replaceAll("-", "_")
                         .substring(2)))
    .map({ case (cli, n) => (cli, n, env(n))})
    .filter({ case (_, _, o) => o.isDefined})
    .flatMap({ case (cli, _, o) => Seq(cli, o.get)})
  val (config, _) = scopt.OParser.runParser(
      scopt.OParser(ods.head, ods.tail.toList),
      cliFromEnvVars,
      Config.Empty)

  config.getOrElse(Config.Empty)
}

In plainer English, here are the hoops I had to jump through:

  • The data is not exposed at all by default, but is thankfully held in a protected variable, so we can extend the object and expose it. If it'd been private we'd have needed to resort to reflection to publish it.
  • Make all arguments optional, because the runParser function returns an Option[C], i.e. if a mandatory argument is missing we get None, instead of a helpful representation of the arguments that were actually there. At this point you should probably wonder if that _.optional call is mutating anything.
  • cliFromEnvVars starts off sensible: we scan all the options (now that we have them), get their names, and turn that into environment-variable-friendly names. A minor annoyance here is that the names we get include the initial double-dash, so we have to remove that (that's what .substring(2) does).
  • cliFromEnvVars does not stop there, however. The only way for us to reuse the existing parser definition we have is to build up a string that looks like a CLI invocation, based on the values we find in the environment, and then run the parser on that, yielding (hopefully) a Config object.
  • Finally, we return said Config object to serve as the "default" value our actual CLI parsing starts from.

Note that if any environment variable contains an invalid value (say, a letter where a number is expected), this code will ignore all environment variables silently. One could of course do better here, but at the cost of more efforts and code complexity.

None of that is insurmontable. After all, I did get that code working. But, for someone who knows about the data-first approach, it is painfully obvious that this is all completely unnecessary work; it's completely self-inflicted pain. Granted, not a huge amount of pain, but still: why?

Going the other way around is also annoyingly hard: if, instead, the CLI definition is not the source of truth, how do you generate it?

In Clojure, assuming the source of truth is Clojure data (and contains all of the required information), or someting that can easily be turned into it (JSON, XML, etc.), it's trivial to write code that generates the representation tools.cli needs, and that will be done using generic Clojure functions to manipulate the same set of types you manipulate all the time. So either you already know how to do it, making it fast and easy to do, or you learn things that are broadly applicable to all Clojure code bases, regardless of whether they use tools.cli, or indeed do any CLI arg parsing at all.

It's probably easier to generate optparse-applicative and scopt definitions than to parse them, but it's still going to require a bit of research in order to gain knowledge that is completely specific to those two libraries and useful nowhere else.

Conclusion

It's not difficult to design a library API in a data-first way. It's a shift in perspective, but it's not necessarily harder. After all, the order of information is essentially the same in all three README examples. If you think that's important, you can still have DSL-looking function calls in order to generate the data, which could look exactly like the current definitions in Haskell and Scala. They would just be building up a data representation instead of the opaque objects they currently are.

The internal implementations of execParser and OParser.parse would need to change, but that's about it. And on the plus side, it would be a great excuse to write an interpreter.

This is, in my opinion, the strongest feature of the Clojure language. As a language, it makes it easy to design APIs in this way. In fact, it almost makes it hard to do anything else. As a community, most people do follow the "Data > Function > Macro" aphorism, which means that a lot of the APIs one interacts with are data-driven.

You don't need to work in Clojure to use this technique. This is very much an approach you can follow in any language. It also does not require the entire ecosystem to change in order to reap benefits: any one library designed this way is already a net win.

Do your users a service. Next time you design an API, think in terms of data first. Their code will be better for it.


  1. That did not exactly panned out as planned, but I still got a lot of exposure to Haskell, as well as a few opportunities to use it at work.

Tags: clojure