What is functional programming?
Last week I wrote a concise, informal introduction to Turing machines and the associated programming model, imperative programming. In this post, I present a similarly informal introduction to lambda calculus and the associated "functional programming" model.
Lambda calculus
Lambda calculus is formally defined as performing reduction operations on lambda terms, where lambda terms are built using these:
- There is an alphabet of symbols representing variables (usually single letters).
- An expression of the form \((\lambda x. M)\), where \(x\) is a variable and \(M\) is a term, is a function definition, also called an abstraction.
- An expression of the form \((M N)\) where both \(N\) and \(M\) are terms is the application of the function \(M\) evaluated to the argument \(N\).
Here are a few examples of lambda terms:
- \(a\)
- \((\lambda x. x) a\)
- \((\lambda f. ((\lambda g. (g g))(\lambda g. (f (\lambda a. ((g g) a))))))\)
Note that our definition of lambda terms implicitly mandates full parenthesisation (that's probably not a word, but I'm sure you know what I mean). This is, however, unnecessary, because under lambda calculus evaluation rules, function application is associative. That is, \(((\lambda x. B[x])(\lambda y. C[y]))a\) reduces to the same expression as \((\lambda x. B[x])((\lambda y. C[y])a)\) would. Here I am using \(B[x]\) as a shorthand for "some lambda term in which \(x\) appears; this is not part of lambda calculus syntax, just a shorthand for generalisation.
The above expressions would thus more commonly be written:
- \(a\)
- \((\lambda x. x) a\)
- \(\lambda f. ((\lambda g. gg)(\lambda g. f (\lambda a. g g a)))\)
I mentioned "evaluation rules"; you may be wondering what those are. I was referring to the reduction operations mentioned above, which are:
- \(\lambda x. M[x]\) can be rewritten to \(\lambda y. M[y]\). This is called \(\alpha\)-conversion. The notation \(M[x]\) here means the variable \(x\) appears as a free variable in the term \(M\). This is mostly there to reduce confusion for humans who might be confused by name collisions; in an ideal world, every single variable is unique throughout the entire expression being evaluated.1
- \(((\lambda x. M[x]) E) \to (M[x := E])\), called \(\beta\)-reduction, consists in replacing all of the bound instances of \(x\) in \(M\) with the term \(E\).
The notions of free and bound variables stem from the ambiguity of using letters for variable names. For example, in the lambda term \(\lambda x. (\lambda x. x)\), using only the rules we have discussed so far, we cannot resolve the innermost \(x\). This is a well-known issue in almost any programming language, and it is resolved by the introduction of scoping rules. Lambda calculus being arguably the first ever programming language, the notion of scoping rules was not immediately obvious2, but the original intent was that it should work as lexical scope is commonly understood these days. That is, that term is equivalent to \(\lambda x_1.(\lambda x_2.x_2)\).
Informally, in \(\lambda x. \lambda y . xyz\), we say that \(x\) and \(y\) are bound, because they correspond to the argument of an enclosing abstraction, whereas \(z\) is free, because it is not bound by an enclosing abstraction. Similarly, in \(\lambda x. ((\lambda y. xyz)y)\), the innermost \(y\) is bound, but the last one is not, and it is important to realize that they are not the same \(y\).
Let's walk through an example of applying those reductions. Consider the term:
\[(\lambda x. x)(\lambda x. x)z\]
It can be rewritten to just \(z\) through the following sequence of steps:
- We start with:
\((\lambda x. x)(\lambda x.x) z\) - \(\alpha\)-reduction to rename \(x\) to \(y\) in the second term,
mostly for clarity:
\((\lambda x. x)(\lambda y.y) z\) - \(\beta\)-reduction on \(y\):
\((\lambda x. x) z\) - \(\beta\)-reduction on \(x\):
\(z\)
Because we did not fully parenthesise the original term, it is ambiguous, and could also be interpreted as associating to the left, yielding:
- We start with:
\((\lambda x. x)(\lambda x.x) z\) - \(\alpha\)-reduction to rename \(x\) to \(y\) in the second term,
mostly for clarity:
\((\lambda x. x)(\lambda y.y) z\) - \(\beta\)-reduction on \(x\):
\((\lambda y. y) z\) - \(\beta\)-reduction on \(y\):
\(z\)
Note that associativity is not the only reason why there may be multiple ways to reduce a lambda term. In the general case of \((M E)\), there is nothing in the definition of lambda calculus that prescribes whether one should first completely reduce \(E\) (or \(M\)) before doing the top-level \(\beta\)-reduction.
Extending lambda calculus
It should be pretty obvious that the lambda calculus, as a model, is a lot simpler than the Turing machine. It is composed of fewer moving parts, and, more importantly, while it follows a different set of rules from normal arithmetic, it does look like "normal" computation. Compare the above evaluation with:
- We start with:
\(1 + 2 * 3\) - evaluating \(*\):
\(1 + 6\) - evaluating \(+\):
\(7\)
Simplicity is nice, but only if it comes without a loss of power. I've already claimed that lambda calculus is exactly as powerful as Turing machines, but at this point it may be hard to see how that can be the case.
We can extend lambda calculus, through mathematical trickery, without compromising its intrinsic qualities and without making it more complex. That is, just by defining additional, shorthand notations that always have a clear mapping to the underlying "low-level" lambda calculus operations.
For example, one can define numbers by first mapping the numeric symbols to lambda terms:
- \(0\to\lambda f.\lambda x.x\)
- \(1\to\lambda f.\lambda x. f x\)
- \(2\to\lambda f.\lambda x. f f x\)
- \(3\to\lambda f.\lambda x. f f f x\)
- \(\dots\)
and then mapping common arithmetic operations in such a way that it works as expected across those representations. For example addition is defined as:
- \(+\to\lambda m.\lambda n.\lambda f.\lambda x.mf(nfx)\)
Coming up with that definition is definitely not trivial, and it took mathematicians a lot of effort to figure this out. But now that it's there we can just look it up and use it.
We can compute \(2 + 1\) with:
- First, write it with parentheses:
\((+ 2) 1\) - replacing \(+\):
\(((\lambda m.\lambda n.\lambda f.\lambda x.mf(nfx)) 2)1)\) - \(\beta\)-reduction on \(m\):
\((\lambda n. \lambda f. \lambda x. 2f(nfx))1\) - \(\beta\)-reduction on \(n\):
\(\lambda f. \lambda x. 2f(1fx)\) - At this point, if we were to replace \(2\) or \(1\) with their definition, we may end up with confusing uses of \(f\) and \(x\). But under our definition of equality, we don't have to name them that, we can pick any name (because of \(\alpha\)-reductions). So, replacing \(f\) and \(x\) in the definition of \(2\) with \(g\) and \(y\), we get:
\(\lambda f. \lambda x. (\lambda g.\lambda y. ggy)f(1fx)\) - Similarly, we replace \(f\) and \(x\) in \(1\) with \(h\) and \(z\)
\(\lambda f. \lambda x. (\lambda g. \lambda y. ggy)f((\lambda h. \lambda z. hz)fx)\) - \(\beta\)-reduction on \(h\):
\(\lambda f. \lambda x. (\lambda g. \lambda y. ggy)f(\lambda z. fz)x\) - \(\beta\)-reduction on \(z\):
\(\lambda f. \lambda x. (\lambda g. \lambda y. ggy)ffx\) - \(\beta\)-reduction on \(g\):
\(\lambda f. \lambda x. (\lambda y. ffy)fx\) - \(\beta\)-reduction on \(y\):
\(\lambda f. \lambda x. fffx\) - and that is our representation for \(3\).
You may think that the "replacing X" steps are cheating, as lambda calculus as defined does not have any mechanism for "global variables". We can work around that. Observe that, in the original expression above, \(+\), \(1\), and \(2\) are free variables. We can simply bind them by enclosing the original expression in an abstraction and applying it. Thus, the original expression \((+2)1\) can be seen as a shorthand notation for what would formally be:
\[(\lambda +. (\lambda 2. (\lambda 1. ((+ 2) 1))))(\lambda m.\lambda n.\lambda f.\lambda x.mf(nfx))(\lambda f.\lambda x. f f x)(\lambda f.\lambda x. f x)\]
or, less formally:
\[(\lambda +. \lambda 2.\lambda 1. + 2 1)(\textrm{def. of }+)(\textrm{def. of } 2)(\textrm{def. of }1)\]
Programming languages
Just like we can recover numbers, we can recover booleans and an if-then-else construct. Additionally, it is possible to define anonymous recursion using the Y combinator. This means that, through simple rewrite rules, we can extend the notation for lambda calculus, without adding anything to the underlying model, to become a programming language with:
- Global, potentially recursive function definitions.
- Integers and arithmetic.
- Conditionals.
It should be pretty clear that integers are enough to redefine any other data type we want, which means that this yields a useable programming language3. The Lisp family of languages is directly inspired by lambda calculus, and Scheme provides a nice ASCII notation for our extended lambda calculus. For example:
(define factorial
(lambda (x)
(if (= 1 n)
1
(* n (factorial (- n 1))))))
and, because this can be seen as just a thin syntactic veneer on top of lambda calculus, this evaluates in pretty much the same way:
(factorial 4)
: replacingfactorial
with its definition.((lambda (n) (if (= 1 n) 1 (* n (factorial (- n 1))))) 4)
: \(\beta\) reduction.(if (= 1 4) 1 (* 4 (factorial (- 4 1))))
: evaluating(= 1 4)
.(if false 1 (* 4 (factorial (- 4 1))))
: applyingif
.(* 4 (factorial (- 4 1)))
: evaluating(- 4 1)
.(* 4 (factorial 3))
: replacingfactorial
with its definition.(* 4 ((lambda (n) (if (= 1 n) 1 (* n (factorial (- n 1))))) 3))
: \(\beta\) reduction.(* 4 (if (= 1 3) 1 (* 3 (factorial (- 3 1)))))
: evaluating(= 1 3)
.(* 4 (if false 1 (* 3 (factorial (- 3 1)))))
: applyingif
.(* 4 (* 3 (factorial (- 3 1))))
: evaluating(- 3 1)
.(* 4 (* 3 (factorial 2)))
: replacingfactorial
with its definition.(* 4 (* 3 ((lambda (n) (if (= 1 n) 1 (* n (factorial (- n 1))))) 2)))
: \(\beta\) reduction.(* 4 (* 3 (if (= 1 2) 1 (* 2 (factorial (- 2 1))))))
: evaluating(= 1 2)
.(* 4 (* 3 (if false 1 (* 2 (factorial (- 2 1))))))
: applyingif
.(* 4 (* 3 (* 2 (factorial (- 2 1)))))
: evaluating(- 2 1)
.(* 4 (* 3 (* 2 (factorial 1))))
: replacingfactorial
with its definition.(* 4 (* 3 (* 2 ((lambda (n) (if (= 1 n) 1 (* n (factorial (- n 1))))) 1))))
: \(\beta\) reduction.(* 4 (* 3 (* 2 (if (= 1 1) 1 (* 1 (factorial (- 1 1)))))))
: evaluating(= 1 1)
.(* 4 (* 3 (* 2 (if true 1 (* 1 (factorial (- 1 1)))))))
: applyingif
.(* 4 (* 3 (* 2 1)))
: evaluating(* 2 1)
.(* 4 (* 3 2))
: evaluating(* 3 2)
.(* 4 6)
: evaluating(* 4 6)
.24
.
Like lambda calculus, functional code is evaluated by applying these three reduction operations on an expression:
- Replacing a name by its definition.
- \(\beta\)-reduction of a lambda term (function application).
- Evaluation of higher-level abstractions (if-then-else, +, etc.).
There is no state to update, no need to keep track of "variable values": everything is embedded in the expression itself, at every step of evaluation.
In a very real way, an expression in a (purely) functional language is a lambda term. Functional programming is lambda calculus, just with a slightly enriched notation.
Why lambda calculus?
Last week we looked at Turing machines, which I argued were important because they can be built. In contrast, lambda calculus is important because it is composable, and that means it can be used to manage complexity, which is at the heart of everything we want to do in software engineering.
In this context, composable refers to the fact that a lambda term is built out of other lambda terms, yielding a tree where every subtree is itself a valid lambda term. One can reason about individual subtrees without considering the entirety of the program. One can also easily reuse subtrees if one is known to do a specific task, in the same way we added numbers and \(+\) to our language above. Once a given lambda term is known to be useful, it is trivial to use it in constructing other lambda terms.
In other words, reasoning about lambda calculus is equivalent to reasoning about a single lambda term at a time. If you think of a lambda term as a tree, you can take any subtree and still be able to reason about it, and its current state at this point in the computation, in isolation. This decomposability is where the power comes from.
In contrast, in order to reason about the current state of a Turing machine, you need to know the entire program (\(\delta\)), the entire current content of the ribbon, the current position of the head and the current state of the machine. There is no obvious, general way to think about a subset of a running Turing machine. If you find a useful Turing machine, such as the 2-bit adder I showed in the previous post, it is completely unclear how to reuse that within another Turing machine.
Lambda calculus may seem like it starts a lot smaller and, in a way, poorer than a Turing machine (because it does), but it is fundamentally extensible through composition and that makes it a lot more powerful4 in practice.
Up next
In the next post in this series, I will take a look at the relationship between lambda calculus and Turing machines, and what that relationship means for everyday programming (yes, there is a point to all this).
There is actually an alternative notation for lambda calculus that uses unambiguous indices instead of variable names. It's great for machines, but it gets a bit tricky for humans.
↩In fact, the original Lisp implementation was an attempt at implementing lambda calculus, but used the wrong scoping rules, commonly known as "dynamic" scope.
↩Well, except for the part where it's pure; it's a programming language that can be "pleasantly" and "productively" used to compute any function, but which cannot be used for any program.
↩I have mentioned multiple times that Turing machines and lambda calculus are, from a theoretical perspective, exactly as powerful as each other, by which I mean that they can compute the exact same set of numbers, or alternatively that they can represent the exact same set of functions. In this context, when I write that lambda calculus is more powerful, I really mean "more pleasant to use for a human", i.e. that the vast majority of these functions are easier to represent in lambda calculus than using Turing machines, even though they can be represented with both.
↩