26 September 2021

Tools You Should Know About: jq

In a nutshell

From the official page

jq is a lightweight and flexible command-line JSON processor.

If you're familiar with tools like grep, sed, and awk, you can think of jq as being to JSON objects as they are to lines of text.

Why you should know about it

JSON is everywhere. For better or worse, it's become the de-facto standard for data sharing. Being able to manipulate JSON on the command-line enables you to use all of your scripting tools to manipulate JSON values. It also augments your scripting toolbelt with a new, very sharp toy.

If you deal with JSON, you need to know about jq. I don't care if you never use the command-line; if so, the command-line is worth learning just for jq.

Feature highlights

jq defines a small, specialized language for manipulating JSON values. The language is documented in the jq manual, but you should not start your jq journey by reading it: it's more of a reference document.

The main concept in the jq language is that of a filter. You can imagine your JSON values flowing through the jq command the same way lines of text flow through sed: the jq filter will be applied to each object in the current "flow" individually.

JSON objects can be compound, and thus some jq filters will produce multiple "output" objects for a single "input" object. In that case, you can chain multiple filters with the | key, which should be a familiar concept if you are used to working at the command-line. The semantics of | is a bit like chaining flatMap operations: conceptually, you first run all of the inputs through the first filter, then collect all of the outputs (possibly more than inputs), and run that through the second filter.

Pretty-printing

The easiest way to get started with jq is to use it for pretty-printing. Pretty-printing the output is the default behaviour, so all you need to do is pipe data through a jq invocation:

$ echo '{"a":1,"b":"hello","c":[1,2,3]}' | jq
{
  "a": 1,
  "b": "hello",
  "c": [
    1,
    2,
    3
  ]
}
$

This is implicitly equivalent to the '.' filter. When you explicitly state the filter, you can pass the input file as a second argument:

$ jq '.' <(echo '{"a":1, "b": "hello", "c": [1, 2, 3]}')
{
  "a": 1,
  "b": "hello",
  "c": [
    1,
    2,
    3
  ]
}
$

(Yes, that <() syntax is a file, sort of. At least as far as jq is concerned.)

Generating new JSON values

jq can also be used to create brand new JSON values. The advantage here is that it will automatically "JSON-escape" given values from the command-line. For example:

$ jq -n --arg argName 'this has a " in it!' '{"title": $argName}'
{
  "title": "this has a \" in it!"
}
$

This is super useful if you have to collect (or generate) data and then include it into a JSON value. A real example from my job was collecting data from git commits and sending it to Slack. Sometimes people put weird characters in git commit messages.

The -n argument tells jq not to look for any input, i.e. in this case we are not applying a filter to an "external" JSON value. Note that we could still apply filters to that:

$ jq -n --arg argName 'this has a " in it!' '{"title": $argName} | .title'
"this has a \" in it!"
$

but I can't come up with a great example of why you'd want to in this case.

Modifying JSON values

My deployment script for this blog keeps track of what is currently deployed in a JSON file. It's a small, simple Bash script, and it would not really be worth trying to write it in other languages, as it's mostly chaining together calls to lein, aws and terraform. Still, it was useful to be able to keep track of state in a structured manner, without having to come up with some Bash-friendly text format.

Here is the relevant bit:

jq -c --arg version $VERSION '. + [{version: $version, ami: null}]' \
  < tf/deployed \
  > tf/deployed.tmp
mv tf/deployed.tmp tf/deployed

The state of my deployment is a list of JSON object with version (the version of my blog, a commit sha) and ami (the Amazon Machine Image used for that deployment). The $VERSION env var is set to the new version the script is deploying. This JSON filter will take the existing content of the deployed file (which is a JSON list) and add a new element to it.

Merging JSON values

Whereas --arg allows you to take non-JSON input and turn it into a JSON-encoded string, --slurpfile will slurp in a whole JSON file and parse it as a JSON value.

$ jq --slurpfile p1 <(echo '{"name": "john", "age": 35}') \
     '[.[] | {owner: $p1, product: .}]' \
     <(echo '[{"id": 1},{"id": 2}]')
[
  {
    "owner": [
      {
        "name": "john",
        "age": 35
      }
    ],
    "product": {
      "id": 1
    }
  },
  {
    "owner": [
      {
        "name": "john",
        "age": 35
      }
    ],
    "product": {
      "id": 2
    }
  }
]
$

There's a little bit more going on with this filter. The .[] | part means that we assume the input is an array, and we want to operate on each element of the array. In the next part, {owner: $p1, product: .}, we are creating a new JSON object with two keys, owner and product, picking the value for owner from the file we slurped, and the value for product from the current element in the input array. Finally, the wrapping [ ... ] are specifying that we want the result of parsing all the elements of the iniitial input as an array.

Structured Bash state

Bash has arrays, but they're a bit clunky. Bash does not have dictionaries. But sometimes you do want a structured value, while it's still not quite worth moving up to a real language. jq can help here.

As an example, I have a script to help me manage Google Cloud instances. The problem it is solving is that, when you want to delete a machine, you have to know both the name of the machine and the zone it's in. I usually only know the name of the machine, so I'd have to look the zone up. It's annoying to have to go to the Google Console every time.

Another issue is that our machine names are a bit long, and I'd like to have autocomplete on them.

The solution I came up with was to have small Zsh function that will query the API once, to get the list of all machines. After that, I can have fast, local auto-complete based on the machine name. But then I'd still have to look up the zone. So what I wanted was to store, locally, in my shell, a list of tuples (machine name, zone).

Here is how that works. First, I have a function refresh_machines that sets a (global to my current shell) variable machines:

refresh_machines() {
  machines=$(gloud compute instance list --format=json [...] \
             jq -c '[.[] | select (.name | startswith("$PREFIX"))
                         | {key: .name, value: (.zone | sub(".*/"; ""))]
                    | from_entries')
}

Then, I have a function kill_machine:

kill_machine() {
  machine="$1"
  gcloud compute [...] --zone=$(echo "$machines" | jq -r ".[\"$machine\"]")
}

and an associated auto-complete function that calls refresh_machines if $machines is not set when I try to auto-complete kill_machine.

The refresh_machines filter in more details:

  • .[]: running for each element in the input. The gcloud command will return a JSON array.
  • select (...): keep only elements from the original array that match a given condition (in this case, their .name field starts with a known prefix).
  • {key: ..., value: ...}: create (for each element in the input array) a new JSON object with just key and value as fields, respectively set to .name and the last part of the path in .zone of the input object.
  • from_entries: this turns an array into an object, expecting each element in the array to be an object with key and value fields.

In other words, the $machine variable contains a JSON object where field names are machine names, and the corresponding value is the zone that machine is in.

This lets us query the $machines object later on with just .[machine-name].

The -c option asks jq to not pretty-print, meaning the entire object is on just one line with no extra space (which is a bit easier to deal with for Bash commands), and the -r option prints JSON strings without quotes.

Conclusion

This was a bit of a whirlwind tour. I don't expect that you come out of this with a deep understanding of how jq works or how to write your own jq filters. Rather, I hope you're coming out of it with:

  • The knowledge that jq exists. Maybe next time you have to deal with JSON values, you'll wonder if it can be applied (hint: yes).
  • Some idea of what jq can do.
  • Some idea of the use-cases for which it may be suited.

It can take a bit of time to learn, but it's tremendously useful and versatile. You won't learn it properly by sitting down and reading about it. Instead, you should make sure, right now, that you have it installed, and start using it immediately. You'll probably use it just for pretty printing at first, but that's good enough to keep it mentally nearby, and you'll find use-cases for it over time. That's when you'll actually learn: when you have a concrete task that you want to do with it.

There is an official tutorial; if you are convinced already, you can go read that now. Otherwise, you can refer to it later on, when you have a real use-case to anchor your learning.

Tags: tyska