The elements of git
This is not a tutorial. If you're looking for a quick, easy, "how to use git" kind of post, look elsewhere.
The goal of this post is to give you just enough understanding of the git internals that you can build up a correct intuition of what various git commands actually do under the hood.
The git behind the curtain
git
is a fraud. You may think it's a complicated piece of code that does a
lot of impressive things. After all, that's the command you invoke all the
time.
It's actually a very simple piece of code: the git
command only does one,
very simple, thing: it looks at its second argument, arg
, and invokes
git-arg
with the rest of the arguments.
There's a bit more to it as it also resolves the path to the proper git-arg
,
regardless of whether you have all the git-*
commands in your PATH. But the
point is: git is not one program. It's a collection of different programs,
written by different people and sometimes in different languages, that all work
together pretty seamlessly.
How do they achieve that kind of collaboration? Glad you asked.
The git data model
The secret is that git is, first and foremost, an on-disk data structure
(hereafter "git repo"). It's very well-defined and documented, which means that
anyone can write a new tool that processes a git repo. This is a great way to
organize software, as it allows each tool to do what it does best, and a
minimal amount of top-level glue (the git
executable) to tie all of them
together and present a seamless experience.
I believe that this "data-first" approach is what allowed git to beat mercurial, by enabling a team completely unrelated to the git project to build GitHub in a different programming language. By contrast, mercurial was architected primarily as a single program, with the internal data representation left as an unstable implementation detail. This gave it some level of unity in both code and user experience, but made it a lot harder to build other projects on top of it.
The other nice property of having a suite of tools built around a common data model is that, if one understands the data model, one can very quickly grasp what any new tool does with it. This is what the rest of this post is about.
Objects and refs
The git model is primarily composed of, on the one hand, four types of objects, and on the other hand, refs to said objects.
As a first approximation, a git repo is a folder with two subfolders: objects
and refs
.
Objects are stored in the objects
directory following a content-based naming
scheme in which every object is named by the SHA1 of its content.1 (For
filesystem efficiency reasons, the objects
folder has 2-letter subfolders
containing the first two letters of the SHA1, and the actual files are named
with the other 38 letters of the SHA1.) In general, objects can have any
content and are immutable: because they are named by their SHA1, if you change
the content, you get a new name, and thus a new object. There may be
circumstances in which you can then delete the old object, but we'll get to
that later on.
Refs are stored in the refs
folder. There a semantically-meaningful
subfolders in refs
, but for now let's just focus on the refs themsevles: a
ref is a text file whose content is exactly one SHA1. It is a ref
erence to
the git object named by that SHA1.
There are four object types and three ref types; we're now going to take a look at each of those in order.
blob
objects
The simplest type of git object is the blob. A blob is just a delimited stream of bytes, with no further structure as far as git is concerned. They are opaque, atomic units of content. In practice, blob objects will represent the content of files in your git worktree.
tree
objects
A tree object represents a directory. It is (conceptually) a simple text file where each line represents an entry in the directory; entries can themselves be of type tree for nested directories, or of type blob for files in the current directory.
Here is an example of a tree object in the git repo for this blog:
$ git cat-file -p 011c9eb4838434f7e431073feb592fce99f6ad74
100644 blob b1bef1e6634f68e14f3e45f5accc9a6ead0852d5 config.edn
040000 tree c6d62294a2323c2047b43787d55a44ff8c94f565 css
040000 tree 07ef3db049157a829ad3b10f2f6e7936ab5aba47 md
$
As previously explained, that file is stored in
.git/objects/01/1c9eb4838434f7e431073feb592fce99f6ad74
. The file itself is a
compressed representation of an encoding of that text, which is why we're using
git-cat-file
with the -p
(pretty print) option to take a look.
Note that the tree does not know its own name: the name of a tree is determined by the parent. This is good if you have two subfolders with the same content in different places in your repo (or at different times): if only the name of a folder changes, all git needs to do is rewrite the parent tree, but the tree itself and all of its content is completely unchanged.
The same goes for blobs: blobs don't know their own name, so renaming a file only requires changing the parent tree, not the blob itself. (And recursively all parent trees, so renaming a file deeply nested down is more "costly", though that's a negligible cost on most filesystems.)
commit
objects
This is the one that ties everything together. Like tree
objects, commit
objects can be thought of as text files. Their structure is a little bit more
complicated, though: they have a number of headers, some of which are optional,
followed by a content. A bit like HTTP requests have a set of headers followed
by a body.
Let's start with looking at an example:
$ git cat-file -p 331c11ee2048a2ec0a1120ccba17085c053eefa0
tree 8a634ffbf8d921e17de4b0202c7c2c289b89e467
parent c00dd30f7eb28d6153b7ea393eca0f79f8cb963b
author Gary Verhaegen <gary.verhaegen@gmail.com> 1632001353 +0200
committer Gary Verhaegen <gary.verhaegen@gmail.com> 1632001353 +0200
blog: copy DNS records from OVH
$
We can see four headers here. The tree
header references a tree
object
which will serve as the root of the worktree for this commit. The parent
header (which can appear any number of times) represents the set of parents for
this commit. In most commits, you'll have one, but merge commits may have two
or more, and the initial commit(s) will have none.
author
and committer
are useful metadata; both can appear multiple times.
It's important to realize that they are just text in a text file and thus very
easy to forge. If that is a concern for you, you should sign your
commits.
Note that git is a blockchain. A blockchain is a succession of "blocks" that are identified by a hash of [their content and a reference to their parent], which is how the "chain" is formed.
tag
objects
Tag objects have a similar structure to commit objects, but a different set of possible headers. In particular, tags do not have a parent, and can point to any type of object. Most often, tags will point to a commit, but you could tag a specific tree or blob (or, I suppose, a tag).
The git UX makes it sometimes hard to differentiate between tag objects and tag
refs. See the "tag
ref" section below for more information.
branch
ref
Branch refs are text file that contain a SHA1. They can represent either local
or remote branches, respectively under .git/refs/heads
and
.git/refs/remotes
. They are generally managed through the HEAD
special ref.
The "name of the branch" is simply the name of the file. You can have slashes
in a branch name; the file will then be nested in subfolders under
.git/refs/heads
.
Branch refs always point to a commit.
tag
ref
Tag refs are stored under .git/refs/tags
. They are, by themselves, not
different from branch tags; they differ only in how the special ref HEAD
handles them. Tag refs, like tag objects, can point to any type of object.
Note that a tag ref is a simple ref: it does not allow one to associate any
metadata with the tag (beyond its name). You create a tag ref by using the
git-tag <tagname> <object>
command. If you want to associate extra metadata
to a tag (for example, a signature), you need a tag object, which is created by
the git tag -a <tagname> <object>
command.
Special refs
In addition to tags and branches, git keeps track of a number of "special"
refs, directly under .git
. The most important one is .git/HEAD
, which
points to the current commit.
Special refs can be direct or symbolic. A direct reference for HEAD
would
mean that the .git/HEAD
file contains a SHA1. This is generally not what you
want. A symbolic reference for HEAD
means that .git/HEAD
contains the path
to a tag or branch.
If HEAD
contains a direct reference or a symbolic reference to a tag (or a
remote branch), your repo is in what's known as "detached HEAD
" state. What
this means is that the next git commit
command will:
- create a new commit object.
- mutate the
.git/HEAD
file to point to the new commit's SHA as a direct ref.
This may not look bad, but contrast that to what the git commit
command does
when HEAD
is a symbolic ref to a local branch:
- create a new commit object.
- resolve HEAD to the branch file.
- mutate the branch file to point to the new commit.
Note that, in this case, HEAD
is unchanged. Instead, git updates the branch
file, which is why branches give this illusion of continuity while working "on
a branch". This also means that you can switch to another commit or branch
without losing your work.
The other special refs are mostly for the internal workings of various git commands, and we're not going to go into more details in this post.
How this all fits together
In the simplest case, a git repo will have:
HEAD
as a symbolic ref torefs/heads/main
.refs/heads/main
as a direct ref to a current commitC0
.C0
pointing to a treeT0
, and to a parentC1
.C1
pointing to a treeT1
, and possibly more parents.T0
may point to blobsB0
andB1
as well as a subfolderT2
.T1
may point to blobsB0
andB2
as well as a subfolderT2
, assumingB2
is the one file that was changed in-betweenC0
andC1
.
There are a few consequences of that approach. First, git does not store diffs. Git computes diffs at the time of showing them. This means that the performance of a git diff between two commits is a function of how different their content is, not how far apart they are in the history DAG. Reusing trees makes for very efficient diffing, as the algorithm knows it does not need to descend into common trees at all.
Not storing diffs also means that any time you change a file, git stores a completely new copy of that file. This may seem fairly wasteful in terms of storage space, and to our first approximation it would be. Fortunately, there is an easy solution to that.
git packs
If you look into .git/objects
, you'll see a bunch of two-letter folders, as
described above, but you'll also see two folders named pack
and info
.
Keeping all the objects as separate files can lead to a lot of filesystem accesses. git was designed specifically for managing source code, which means it generally expects lots of small files, which in turn means lots of file accesses, which means lots of kernel calls, and that's slow.
So git can also store its objects in what is known as a "pack". The details are not important here; what matters to our size concerns is that a pack is a compressed representation, which means that for fairly transparent file formats like plain text it's likely the compression algorithm will take care of not repeating large chunks of identical text too many times.
You can explicitly ask git to "repack" all of its objects with the git repack
command.
git garbage
In "normal" git operations, all objects should be saved forever. When you create a new commit, it points to its parent, which means that you can't throw away the parent. Creating new commits, including merge commits, is generally a purely-additive operation.
Sometimes, though, some operations are destructive. The main user-level command
for that is git rebase
, though git reset
and git commit --amend
can also
be used to create new commits that ignore part of the existing history.
More simply, you can just decide to throw away a branch after having made a few
commits that don't end up leading where you wanted. Or you can accidentally
work in a detached HEAD
state, and then switch away to some other commit,
unintentionally creating unreachable commits (and potentially losing the
associated work).
When that happens, you can end up with "orphan" git objects, defined as objects
that cannot be reached from either HEAD or any of the user refs (branches and
tags). You can ask git to scan its object files (and packs) for such orphan
objects and remove them with the git gc
command.
Why git does not work with large binary files
By this point, you should be able to deduce why git repos would struggle with large binary files: as every change is stored as a separate file, binary files (which typically don't compress well even for apparently small changes) will quickly take up quite a bit of space in either object or pack form.
But it's more than just storage. Because git is designed to work with text and be quite fast, it assumes it's dealing with small files, and some of its algorithms reflect that. For example, the diff algorithm will (try to) load both blobs in memory to compare them.
Conclusion
You can use git every day for years without ever feeling like you need to know any of this. But a lot of people express some level of frustration with git and its various subcommands, and I hope that knowing this may help alleviate at least some of that frustration.
For myself, I have found that knowing this has made working with git both more pleasant and a lot safer, as I'm able to identify which commands might be dangerous and how to properly use them without losing work.
This is a good enough approximation to understand what's going on, but if you actually try to
↩sha1sum
the objects in a git repo, you'll see that it does not match. The reason for that is that git prepends a header indicating the type and size of the object when computing the hash, whereas the stored object itself only contains the content of the object. Also note that this content is sometimes encoded or compressed; for this blog post we're going to ignore all that and discuss the conceptual values of objects.