Latest Posts

Monte Carlo Integration – Part 3

The last post explored the error analysis for Monte Carlo integration in detail.  Numerical experiments indicated that the error in the algorithm goes as $1/\sqrt{N}$, where $N$ are the number of Monte Carlo trials used to make the estimate.  Further analysis of the sampling statistics led to a fundamental formula describing the estimated error of the estimated value $\langle I \rangle$ of the integral as

\[ \sigma_{\langle I \rangle } = \frac{\sigma}{\sqrt{N}} \; , \]

where $\sigma$ is the standard deviation of the random process used to evaluate the integral.  In other words, we could report the estimated value of some integral as

\[ I = \int_a^{b} \, dx f(x) \approx \frac{1}{N} \sum_{i} \frac{f_i(x)}{p_i(x)} \pm \frac{\sigma}{\sqrt{N}}  \; , \]

where $p(x)$ is some known probability distribution.  We report this estimate with the understanding that the last term is an estimated $1\sigma$ error of the sampling mean, which is expected to ‘look’ Gaussian as a consequence of the central limit theorem.

While it is attractive to think that an appropriate strategy to improve the estimate of $I$ is to crank up the number of Monte Carlo trials so that the error is as small as desired doing so completely ignores the other ‘knob’ that can be turned in the error equation, namely the value of $\sigma$.

At first, it isn’t obvious how to turn this knob and why we should even consider figuring out how.  Motivation comes when we reflect on the fact that in the most important and interesting cases the integrand $f(x)$ may be very or extremely computationally expensive to evaluate.  In those cases, we might look for a way to make every Monte Carlo trial count as maximally as it can.  Doing so may cut down on run times to get a desired accuracy from days to.

Given that motivation, the next question is how.  An illustrative example helps.  Consider the integral of a constant integrand, say $f(x)=2$ over the range $0$ to $1$ sampled with a uniform probability distribution.  The estimate of the integral becomes

\[ I = \int_0^{1} \, dx 2 \approx \frac{1}{N} \sum_{i} 2 = \frac{2N}{N} = 2 \; , \]

regardless of the number of trials.  In addition since each trial returns the same value, the sample standard deviation is $\sigma = 0$.  The estimate is exact and independent of the number of trials. 

Our strategy for the general $f(x)$ is quite clear; we want to pick the probability measure such that the ratio $f(x)/p(x) = \textrm{constant}$.  This strategy is termed importance sampling because the samples are preferentially placed in regions where $f(x)$ is bigger. This isn’t always possible but let’s examine another case where we can actually do this. 

Let’s estimate the integral

\[ I = \int_{0}^{1} \, dx \left(1 – \frac{x}{2} \right)  = \frac{3}{4} \; .\]

The integrand is linear across the range taking on the values of $1$ when $x=0$ and $1/2$ for $x=1$.  We want our probability distribution also be linear over this range but to be normalized.  A reasonable choice is to simply normalize the integrand directly, giving

\[ p(x) = \frac{4}{3}\left( 1 – \frac{x}{2} \right) \; .\]

The ratio $f(x)/p(x) = \frac{3}{4}$ and our expectation is that the Monte Carlo integration should give the exact result. 

The only detail missing is how to produce a set of numbers taken from that probability distribution.  To answer this question let’s follow the discussion in Chapter 8 of Koonin’s Computational Physics and rewrite the original integral in a suggestive way

\[ \int_0^{1} \, dx f(x) = \int_0^{1} \, dx \, p(x) \frac{f(x)}{p(x)} \; . \]

Now, define the cumulative distribution as

\[ y(x) = \int_0^x \, dx’ p(x’) \; \]

and make a change of variables from $x$ to $y$.  Using Leibniz’s rule,

\[ \frac{dy}{dx} = p(x) \; , \]

which gives $dx = dy/p(x)$.  Turning to the limits of the integral, we find $y(x=0) = 0$ and $y(x=1) = 1$ and the new form of the integral is

\[ \int_0^{1} \, dy \frac{f(y)}{p(y)} \; .\]

We can now tackle this integral by sampling $y$ uniformly over the interval $[0,1]$ as if the integral were originally written this way.   

Since $y$ is uniformly distributed, if an anti-derivative exist for $p(x)$ and we can invert the result to express $x=x(y)$ we will automatically get the desired distribution for random realizations of $p(x)$.  For the case at hand, $p(x) = \frac{4}{3}\left(1 – \frac{x}{2}\right)$.  The expression for $y$ is then

\[ y = \frac{4}{3} \left(x – \frac{x^2}{4} \right) \; .\]

This quadratic equation can be solved for $x$ to give

\[ x = 2- (4- 3y)^{1/2} \; .\]

The following figure shows the distribution of $x$ as a random variable given a 100,000 uniformly distributed random numbers representing $y$.

All of this machinery is easily implemented in python to provide a side-by-side comparison between directly sampling the original integral using both a uniform distribution and the importance sampling using the linear distribution.

import matplotlib as mpl
import numpy      as np
import pandas     as pd
import scipy      as sp
import matplotlib.pyplot as plt
Ns   = [10,20,50,100,200,500,1_000,2_000,5_000,10_000,100_000]
xmin = 0
xmax = 1
df   = pd.DataFrame()
def f(x): return (1.0 - x/2.0)
def w(x): return (4.0-2.0*x)/3.0
def x(x): return 2.0 - np.sqrt(4.0 - 3.0*x)
for N in Ns:
    y   = np.random.random((N,))
    Fu  = f(y)
    X   = x(y)
    Wx  = w(X)
    Fx  = f(X)
    row = {'N':N,'I_U':np.mean(Fu),'sig_I_U':np.std(Fu)/np.sqrt(N),\
          'I_K':np.mean(Fx/Wx),'sig_I_K':np.std(Fx/Wx)/np.sqrt(N)}
    df = df.append(row,ignore_index=True)
print(df)

The following table shows the results from this experiment (i.e. the contents of the pandas dataframe).

$N$$I_U$$\sigma_{I_U}$$I_L$$\sigma_{I_L}$
100.8115360.0529120.750.0
200.7876590.0338660.750.0
500.7689510.0204410.750.0
1000.7558520.0144720.750.0
2000.7525330.0104970.750.0
5000.7497500.0065800.750.0
10000.7497770.0046020.750.0
20000.7459350.0031840.750.0
50000.7480360.0020350.750.0
100000.7511550.0014380.750.0
1000000.7495970.0004570.750.0

It is clear that the estimate of the integral $I_U$ slowly converges to the exact value of $3/4$ but that by matching the importance sampling weight $p(x)$ such that $f(x)/p(x)$ is a constant, we get an exact value $I_L$ of the integral with no uncertainty.

Of course, this illustrative example is not meant to be considered a triumph since we had to know the exact value integral to determine $p(x)$ in the first place.  But it does point to applications where we either don’t or can’t evaluate the integral. 

In the next post, we’ll explore some of these applications by looking at more complex integrals and discuss some of the specialized algorithms designed to exploit importance sampling.

Monte Carlo Integration – Part 2

The last post introduced the basic concept of Monte Carlo integration, which built on the idea of sampling the integrand in a structured (although random) way such that the average of the samples, appropriately weighted, estimates the value of a definite integral.  This method, while not as accurate for lower-dimensional integrals (for a given invested amount of computer time) as the traditional quadrature techniques (e.g., Simpson’s rule), really shines for problems in 4 or more dimensions.  (Note: The method also works well for multiple integrals in 2 and 3 dimensions when the boundary is particularly complex.)  This observation is based on the fact that the error in the Monte Carlo estimation goes as $1/\sqrt{N}$, $N$ being the number of samples, independent of the number of dimensions.

This post explores the error analysis of the Monte Carlo method and demonstrates the origin of the $1/\sqrt{N}$ dependence.

The first step is simply to look at how the error in the estimation of the integral varies as a function of the number of samples (hereafter called the number of Monte Carlo trials).  The example chosen is the polynomial

\[ f_p(x) = 3.4 x^3 -5.9 x^2 + 0.3 x -15.2 \; \]

used in the previous post.  The limits of integration are $x_{min} = -2$ and $x_{max} = 3.5$ and the exact value of this integral is $I_{p} = -68.46354166666669$.  The following table shows the error between the exact value and the value estimated using the Monte Carlo sampling method for the number of trials ranging from 100 to 100,000,000 trials.  All the data were obtained using a Jupyter notebook running python and the numpy/scipy ecosystem.

Num Trials$I_p$Error ($\epsilon_p$)
100-96.86925428.405712
1,000-67.7295110.734030
5,000-69.9607831.497241
10,000-66.8818701.581672
50,000-67.7211440.742398
100,000-68.3274800.136062
500,000-68.7475380.283996
1,000,000-68.4945550.031013
10,000,000-68.5486580.085116
100,000,000-68.4680600.004519

Even though the error generally trends downward as $N$ increases, there are fluctuations associated with the random nature of the sampling process.  The figure below shows a log-log scatter plot of these data along with a best fit line to the points.

The value of the slope returned by the linear regression was -0.529, very close to the expected value of $-1/2$; the $r^2$ value of the fit was 0.85 indicating that the linear fit very well explained the trend in the points, although it is important to realize that subsequent runs of this numerical experiment will produce different results.

While this result is compelling it lacks in two ways.  First, due to the random nature of the sampling, the trend will never be nailed down to be exactly an inverse square root.  Second, we had an exact value to compare to and so could quantitatively gauge the error.  In practice, we want to use the method to estimate integrals whose exact value is unknown and so we need an intrinsic way of estimating the error.

To address these concerns, we expand on the discussion in Chapter 10 of An Introduction to Computer Simulation Methods, Applications to Physical Systems, Part 2 by Gould and Tobochnik.  

In their discussion, the authors considered estimating

\[ I_{circ} = \int_0^1 dx \, 4 \sqrt{1-x^2}  = \pi \;  \]

multiple times using a fixed number of trials for each estimation.  Comparing the spread in the values of each estimation then gives a measure of the accuracy in the estimation.  To be concrete, consider the 10 estimations performed using 100,000 trials each.  The following table summarizes these cases with the first column being the average, the second the error against the exact value of $\pi$, (a situation, which, in general, we will not have).  The third channel is the standard deviation as measure of the spread of the sample used in the estimation.  We’ve introduced with hope of it filling the role of an intrinsic estimate of the error.

$I_{circ}$Error ($\epsilon_p$)Standard Deviation
3.1430740.0014810.892373
3.1467960.0052030.888127
3.1426260.0010340.895669
3.1392040.0023890.893454
3.1459470.0043550.891186
3.1422090.0006160.892600
3.1370250.0045680.896624
3.1360140.0055780.895747
3.1403970.0011960.892305
3.1406930.0009000.892604

We can perform some statistical reduction of these data that will serve to direct us further.  First of all, the mean estimate (mean of the means) is $\langle I_{circ} \rangle = 3.1413985717765454$ with a standard deviation (sigma of the means) of $\sigma_{\langle I_{circ} \rangle} = 0.003304946237283195$.  The average of the standard deviation (mean of the sigmas) is $<\sigma>= 0.8930688485193065$.  Finally, the average error $<\epsilon> = 0.0027319130714524853$.  Clearly, $\sigma_{\langle I_{circ}\rangle} \approx <\epsilon>$ but both are two orders of magnitude smaller than $<\sigma>$.  A closer look reveals that the ratio $<\sigma>/\sigma_{\langle I_{circ}\rangle}$ is approximately $300 \approx 1/\sqrt{100,000}$.  This isn’t a coincidence.  Rather it reflects the central limit theorem.

The central limit theorem basically says that if we pull $K$ independent samples, each of size $N$, from an underlying population, then regardless of the structure of the population distribution (with some minor restrictions) the distribution of the $K$ averages computed will be Gaussian. 

This point is more easily understood with specific examples.  Below is a plot of the distribution of values of $4 \sqrt{1 -x^2}$ for a single Monte Carlo estimation with $N=100,000$ trials.  The histogram shows a distribution that strongly looks like an exponential distribution.

However, when this estimation is repeated so that $K=1,000$ and the distribution of those thousand means are presented in a histogram, the distribution looks strongly like a normal (aka, Gaussian) distribution.

The power behind this observation is that we can then estimate the error in original estimation in terms of the standard deviation of the sample itself.  In other words,

\[ \sigma_{\langle I_{circ} \rangle} =  \frac{<\sigma>}{\sqrt{N}} \; .\]

The proof of the generalize relation (Appendix 10B of Gould & Tobochnik) goes as follows.

Start with the $K$ independent samples each of size $N$ and form the $K$ independent means as

\[ M_k = \frac{1}{N} \sum_{n=1}^{N} x^{(k)}_n \; ,\]

where $x^{(k)}_n$ is the value of the $n^{th}$ trial of the $k^{th}$ sample.

The mean of the means (i.e. the mean over all $K \cdot N$ trials) is then given by

\[ {\bar M} = \frac{1}{K} \sum_{k=1}^{K} M_k = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{N} \sum_{n=1}^{N} x^{(k)}_n \; .\]

Next define two measures of the difference between the individual measures and the mean of the means: 1) the $k^{th}$ residual, which measures the deviation of an $M_k$ from ${\bar M}$ and is given by

\[ e_k = M_k-{\bar M} \; \]

and 2) the deviation, which is the difference between a single trial and ${\bar M}$, is given by

\[  d_{k,n} = x^{(k)}_n – {\bar M} \; . \]

The residuals and deviations are related to each other as

\[ e_k = \frac{1}{N}\sum_{n=1}^{N} x^{(k)}_n – {\bar M} = \frac{1}{N} \sum_{n=1}^{N} \left( x^{(k)}_n – {\bar M} \right) =  \frac{1}{N} \sum_{n=1}^{N} d_{k,n} \; .\]

Averaging the squares of the residuals gives the standard deviation of the ensemble of means

\[ {\sigma_M}^2 = \frac{1}{K} \sum_{k=1}^{K} {e_k}^2 \;. \]

Substituting in the relation between the residuals and the deviations gives

\[ {\sigma_M}^2 = \frac{1}{K} \sum_{k=1}^{K} \left( \frac{1}{N} \sum_{n=1}^{N} d_{k,n} \right) ^2 \; .\]

Expanding the square (making sure to have independent dummy variables for each of the inner sums) yields

\[ {\sigma_M}^2 = \frac{1}{K} \sum_{k=1}^{K} \left( \frac{1}{N} \sum_{m=1}^{N} d_{k,m} \right) \left( \frac{1}{N} \sum_{n=1}^{N} d_{k,n} \right) \; .\]

Since the individual deviations are independent of each other, there should be as many positive as negative terms, and averaging over the terms where $m \neq n$ should result in zero.  The only surviving terms are when $m = n$ giving

\[ {\sigma_M}^2 = \frac{1}{K} \frac{1}{N^2} \sum_{k=1}^{K} \sum_{n=1}^{N} d_{k,n}^2 = \frac{1}{N} \left[ \frac{1}{KN} \sum_{k=1}^{K} \sum_{n=1}^N (x^{(k)}_n –{\bar M} )^2 \right] \; . \]

The quantity in the braces is just the standard deviation $\sigma^2$ of the individual trials (regardless to which sample they belong) and so we get the general result

\[ {\sigma_M} = \frac{\sigma}{\sqrt{N}} \; . \]

Next post will examine the implications of this result and will discuss importance sampling as a mechanism to drive the estimated error to a minimum.

Monte Carlo Integration – Part 1

It is a peculiar fact that randomness, when married with a set of deterministic rules, often leads to richness that is absent in a set of deterministic rules by themselves.  (Of course, there is no richness in randomness by itself – just noise).  An everyday example can be found in the richness of poker that exceeds the mechanistic (albeit complex) rules of chess.  This behavior carries over into the realm of numerical analysis where Monte Carlo integration presents a versatile method of computing the value of definite integrals.  Given the fact that most integrands don’t have an anti-derivative, this technique can be very powerful indeed.

To understand how it works, consider a function $g(x)$, which is required to be computable over the range $(a,b)$, but nothing else.  We will now decompose this function in a non-obvious way

\[ g(x) = \frac{f(x)}{p(x)} \; , \]

where $p(x)$ is a normalized function on the same interval

\[ \int_a^b dx p(x) = 1 \; . \]

If we draw $N$ realizations from $p(x)$

\[ x_1, x_2, \ldots, x_N \; \]

and we evaluate $g(x)$ at each value

\[ g_1, g_2, \ldots, g_N \; , \]

where $g_i \equiv g(x_i)$.

On one hand, the expected value is

\[ E[g(x)] = \frac{1}{N} \sum_i g_i \equiv {\bar g} .\]

This is the empirical value of the sampling of $g(x)$ according to $p(x)$. 

The theoretical value is

\[ E[g(x)] = \int_a^b dx p(x) g(x) = \int_a^b dx p(x) \frac{f(x)}{p(x)} = \int_a^b dx f(x) \; . \]

So, we can evaluate the integral

\[ I = \int_a^b dx f(x) \; \]

by sampling and averaging $g(x)$ with a sampling realization of $p(x)$.

The simplest probability distribution to consider is uniform on the range

\[ p(x) = \frac{1}{b -a} \; \]

and with $g(x) = (b-a) f(x)$ we can evaluate

\[ I = \int_a^b dx f(x) = \frac{1}{N} \sum_i g_i = \frac{b-a}{N} \sum_i f_i \; , \]

where $f_i \equiv f(x_i) $.

Before pressing on with the theory, let’s take a look at some specific examples.

For the first example, let’s look at the polynomial

\[ f_p(x) = 3.4 x^3 – 5.9 x^2 + 0.3 x – 15.2 \; \]

This function has the antiderivative

\[ F_p(x) = \frac{3.4}{4} x^4 – \frac{5.9}{3} x^3 + \frac{0.3}{2} x^2 – 15.2 x \; . \]

The integral of $f_p$ from $x_{min} = -2$ to $x_{max} = 3.5$ has the exact value

\[ \int_{x_{min}}^{x_{max}} dx f_p = F_p(x_{max}) – F_p(x_{min}) = -68.46354166666669 \; . \]

A simple script in python using numpy, scipy, and a Jupyter notebook was implemented to explore numerical techniques to perform the integral.  The scipy.integrate.quad function, which uses traditional quadrature based on the Fortran QUADPACK routines, returned the value of -68.46354166666667, which is exactly (to within machine precision) the analytic result.  The Monte Carlo integration, which is implemented with the following lines

N = 100_000
rand_x = x_min + (x_max - x_min)*np.random.random(size=N)
rand_y = f(rand_x)
I_poly_est  = (x_max - x_min)*np.average(rand_y)

returned a value of -68.57203483190794 using 100,000 samples (generally called a trial – a terminology that will be used hereafter) with a relative error of 0.00158.  This is not bad agreement but clearly orders of magnitude worse than the deterministic algorithm.  We’ll return to this point later.

One more example is worth discussing.  This time the function is a nasty, nested function given by

\[ f_n = 4 \sin \left( 2 \pi e^{\cos x^2} \right) \; , \]

which looks like

over the range $[-2,3.5]$.  Even though this function doesn’t have an anti-derivative, its integral clearly exists since the function is continuous with no divergences.  The deterministic algorithm found in scipy.integrate.quad generates the value of -2.824748541066965, which we will take as truth.  The same Monte Carlo scheme as above yields a value -2.8020549889975164 or relative error 0.008034.

So, we can see that the method is versatile and reasonably accurate.  However, there are two questions that one might immediately pose: 1) why use it if there are more accurate methods available and 2) can the estimates be made more accurate?  The remainder of this post will deal with answering question 1 while the subsequent post will answer the second question.

To answer question 1, we will cite, without proof, the answer to question 2 that states the error $\epsilon$ in the estimate of the integral goes to zero as the number of samples according to

\[ \epsilon \sim \frac{1}{\sqrt{N}} \; \]

independent of the number of dimensions. 

To determine the error for a conventional quadrature method (e.g. trapezoidal or Simpson’s), we follow the argument from Steve Koonin’s Computational Physics p 201.  Assume that we will evaluate the integrand $N$ times (corresponding to the $N$ times it will be evaluated with the Monte Carlo scheme).  The $d$-dimensional region over for the integral consists of a volume ${V} ~ {h}^d$.  Distributing the $N$ points over the volume means that the grid spacing $h ~ N^{-1/d}$. 

Koonin states that an analysis similar to those used for the usual one-dimensional formulae gives the error as going $O\left(h^{d+2}\right)$.  The total error of the quadrature thus goes as

\[ N O\left(h^{d+2}\right) = O\left(N^{-2/d}\right) \; .\]

When $d \geq 4$, the error in estimating the integral via Monte Carlo becomes smaller than that for standard quadrature techniques for a given computing time invested.  This reason alone justifies why the Monte Carlo technique is so highly favored.  In particular, Monte Carlo proves invaluable in propagating variations through dynamical systems whose dimensionality precludes using a gridded method like traditional quadrature.

The next post will derive why the Monte Carlo error goes as $1/\sqrt(N)$.

Paradigms and Constraints

To start, this month’s column brings the experimental exploration of category theory from Lewvere’s and Schanuel’ book to an end.  Why?  Well, up to this point, the invested effort seems to be far larger than the gained benefit.  Category theory promises deep insight into more complex ideas – otherwise why would anyone study it – but the amount of effort required to get to that place in the text by Lewvere and Schanuel seems overwhelming.  There is no middle ground in their work.  Either it is painfully pedantic or deeply complex.

So, as a change of pace, this column will be returning to simpler, one-off topics.  For this month, I thought it would be interesting to share a thought about the overlap between philosophy and computer programming that came about from an interesting video.  In this video, Dave Farley of Constant Development discusses the history of computing paradigms with an eye towards the tribal standoffs between the object oriented crowd versus the functional programming adherents, with particular attention between the tribal standoffs between the two over which paradigm is best.

While it is both comforting and amusing is the Dave says that both positions are ‘rubbish’ (with a delightful accent no less).  I heartily agree with his basic point that one should pick the right paradigm for the right job.  Along those lines, the most insightful thing that he had to say is an observation he attributed to “Uncle” Bob Martin.  Martin, a computing pundit of some note, says the purpose of a programming paradigm to constrain us, to take away a degree of freedom that we had otherwise been allowed. 

After reflecting on this point on more than one occasion, I realized, even though Farley didn’t say this, that the idea of constraining the world as part of a paradigm is both an ontological and epistemological metaphysical process whose scope is far larger than just computer programming.  A paradigmatic way of thinking constrains the way we perceive reality by constraining how we divvy up the world around us into pieces we understand.

One of the earliest texts covering the idea of chopping up the world into useful chunks is found in the philosophical work of Aristotle entitled the Categories.  In this work, Aristotle lists 10 separate categories that that cover all the possible things that can be the subject of a proposition.  The work centers around language and how parts of speech reflect the ideas we hold and this ten-fold division was meant to encompass everything a person can think (or perhaps just talk) about.  While the work is useful, because it deals with common observations and mode of thought we all share, it is dense and sometimes difficult.  Mortimer Adler, in his book Aristotle for Everyone, calls it uncommon common sense.  Fortunately, the key point for this discussion is that Aristotle offers a paradigm (perhaps hard to grasp) that constrains how we think (or describe) the world.

A more prosaic example comes from our everyday experience of communicating with our fellow beings.  For example, many of us would consider the word ‘automobile’ to be straightforward to define and we expect that the dictionary entry would be non-controversial.  However, exactly how each of us thinks about ‘automobile’ depends on the paradigm being used. 

For a thermal engineer, whose livelihood centers on the study of work and heat, an automobile make conjure the concept of a gasoline or diesel engine, a drive chain and an exhaust.  He has constrained the world to one dealing with how energy is stored, how much of it can be stored in a given volume, and how it moves from place to place.  A systems engineer, whose livelihood derives from understanding the interplay of various technologies, would mostly see a car as a set of subsystems designed to interact appropriately.  Does the engine provide the necessary power to match the steering and suspension?  Does the fuel tank and milage match the intended purpose?  And so on.  Our systems engineer has constrained his view of the automobile in such a way ignores the details of thermodynamics and deals only with the abstract ways in which such subsystems interface with each other.  To an urban planner, the automobile represents an atomic entity whose characteristics and behaviors are constrained to be at the driving level.  Doing so, allows the urban planner to deal with matters of traffic congestion, proper placement of tolls, and the like without concerning himself with the way in which the automobile moves or how well it moves.

Now let’s returning to the question of computer programming paradigms.  Certainly, a programmer must know, when modeling an automobile, what level his solution must deal with.  Is he programming a finite-element thermal model, an abstract entity-relationship, or a stochastic Monte Carlo simulation to match the fictitious engineers of the previous discussion.  However, the interesting aspect of a seasoned programmer is that when thinking about programming paradigms in the abstract he is actually engaging in metaphysics.  He is not asking how to describe how a thing can be an automobile but how a thing can be.  The comparisons of various paradigms naturally leads to the study of being itself.  The computer programmer has to deal with not only the problem of deciding into what category the thing he is interested falls but also with the meta-problem of how to categorize the categorizations; how to pick between the various ways of abstracting and how to apply them to the ontology that is the world.

Farley raises another point that is worth pondering.  He believes that human beings are intrinsically classifying animals, even if he doesn’t phrase it quite that way.   For that reason, he believes that the object-oriented paradigm is more naturally matched to human thought.  However, his definition of object oriented programing as the marriage of data and behaviors with polymorphism as the constraint agent (each object owns its specific response to a common message like fall) does lead to some difficult ontological questions when modeling the physical world.  Does a rock know how to behave under the influence of gravity because it knows how to interpret the message sent from the Earth or does the Earth know how to make the rock fall because of a message that the rock sends, or is the message loop (i.e. the gravitational interaction) the key point?  Even though an exploration of this question is beyond the scope of this post the ability to pose this question shows the power of thinking about paradigms as constraints.

It is for this reason that I think programming will becoming increasingly more and more important component of our day-to-day lives; not just because of the practical applications involving our smartphones and tablets, our X-boxes and Play Stations, and our computers working to solve this problem or that, but because it will trigger in us a way to look at the world anew.   If only Aristotle could have lived to participate in these debates.

Category Theory 6 – Coding Up Mappings

This month’s article is a bit of a departure from the previous installments on category theory in that it deals with the computational side of things rather than the theoretical side of things.  But, as the recent columns have shown, it is typically the case that working with specific examples facilitates a basic understanding of the general.

As a case in point, consider the set of all possible mappings between the sets $A$, $B$, and $C$, an example of which is shown below

where $A$ has three elements $\left\{ a_1, a_2, a_3\right\}$, $B$ has two elements $\left\{ b_1, b_2 \right\}$, and $C$ has three elements $\left\{ c_1, c_2, c_3\right\}$.  Although we’ve focused on automorphisms wherein we identified $A$ and $C$ as the same set, the general case will be what we tackle here. 

Even though these sets are small, the combinatorics associated with all possible mappings quickly daunts the use of paper and pencil.  There are 8 distinct mappings $f:A \rightarrow B$ and 9 distinct mappings from $g:B \rightarrow C$ leaving 72 mappings in total for the composition.  Obviously this situation calls for some basic computing.

Despite the simplicity of the discussion up to date in Lawvere and Schanuel, one thing they didn’t do is actually adopt terminology well-suited to writing computer code.  The word ‘set’, of course, is perfect for describing the collection, so that’s okay, but what to call the contents of each set.  The typical mathematical terminology would be members or elements but the former has an object-oriented connotation that would tend to obscure things while the latter is simply too generic.  In addition, both of those are rather long to type and really don’t speak to what the diagram looks like.  For the purposes here, things like $a_3$ or $b_1$ will be called dots since they are represented as such in the diagrams.  At this point, their terminology fails and the thing that connects two dots doesn’t even have a name in their bestiary.  Since it is drawn as an arrow in the diagrams, that is the name that will be used in the code.  The end of the arrow without the point will be called the outgoing side; the end with the point is the incoming side.

Using this terminology, we then say that:

  • each set is made up of dots,
  • each dot supports one outgoing arrow
  • each dot can receive any number of incoming arrows, including zero
  • a mapping is a collection of arrows

The python programming language is an excellent choice and everything that follows is done in version 3.x within the Jupyter notebook framework.  The sets discussed above are represented as python lists:

  •  Aset = ['a1','a2','a3']
  •  Bset = ['b1','b2']
  •  Cset = ['c1','c2','c3']

The first function

def generate_arrows(set1,set2):
    arrows = []
    for dot1 in set1:
        for dot2 in set2:
            arrows.append(dot1+'->'+dot2)
    return arrows

returns a list of all possible ways to draw arrows from set1 to set2.  Applying this function to sets $A$ and $B$ gives:

generate_arrows(Aset,Bset)
['a1->b1', 'a1->b2', 'a2->b1', 'a2->b2', 'a3->b1', 'a3->b2']

This helper function is used by the larger function

def generate_mappings(set1,set2):
    import itertools as it
    
    arrows      = generate_arrows(set1,set2)
    mapping_gen = it.combinations(arrows,len(set1))
    mappings    = []
    for candidate in mapping_gen:
        count = []
        for i,dot1 in zip(range(len(set1)),set1):
            count.append(0)
            for arrow in candidate:
                if dot1 in arrow:
                    count[i] += 1
        if min(count) != 0:
            mappings.append(candidate)
    return mappings

The function generate_mappings depends on the itertools module in python to determine all the possible combinations to make a mapping.  When applied to $A$ and $B$, the output of

A_to_B = generate_mappings(Aset,Bset) 

[('a1->b1', 'a2->b1', 'a3->b1'),
 ('a1->b1', 'a2->b1', 'a3->b2'),
 ('a1->b1', 'a2->b2', 'a3->b1'),
 ('a1->b1', 'a2->b2', 'a3->b2'),
 ('a1->b2', 'a2->b1', 'a3->b1'),
 ('a1->b2', 'a2->b1', 'a3->b2'),
 ('a1->b2', 'a2->b2', 'a3->b1'),
 ('a1->b2', 'a2->b2', 'a3->b2')]

is a list 8 elements in length as expected.  Each line is one possible mapping between the two sets.  Likewise, the total listing of all mappings from $B$ to $C$

[('b1->c1', 'b2->c1'),
 ('b1->c1', 'b2->c2'),
 ('b1->c1', 'b2->c3'),
 ('b1->c2', 'b2->c1'),
 ('b1->c2', 'b2->c2'),
 ('b1->c2', 'b2->c3'),
 ('b1->c3', 'b2->c1'),
 ('b1->c3', 'b2->c2'),
 ('b1->c3', 'b2->c3')]

has 9 possibilities, also as expected.

In order to compose functions, we need a way of combining expressions like ‘a1->b1’ and ‘b1->c3’ to get ‘a1->c3’.  This is provided by

def compose_arrows(arrow1,arrow2):
    head1,targ1 = arrow1.split('->')
    head2,targ2 = arrow2.split('->')
    if targ1 == head2:
        return head1+'->'+targ2
    else:
        return False

Note that if the middle term is not the same the function returns a False.

The functionality is rounded out with the

def compose_mappings(mappings1,mappings2):
    mappings = []
    for f in mappings1:
        for g in mappings2:
            compositions = []
            for f_arrow in f:
                for g_arrow in g:
                     compositions.append(compose_arrows(f_arrow,g_arrow))
            while(False in compositions):
                compositions.remove(False)
            mappings.append(tuple(compositions))
    return mappings

which composes two lists of mappings.  Plugging in the previous two lists of mappings (mappings = compose_mappings(A_to_B,B_to_C)) gives a list with 72 elements.  Of course, given the sizes of $A$ and $C$ there can be, at most, only 27 unique ones.  The repetitions are trivially removed by casting the list to a set and then casting it back to a list.  Doing this gives the following list of unique mappings:

[('a1->c1', 'a2->c2', 'a3->c1'),
 ('a1->c2', 'a2->c3', 'a3->c2'),
 ('a1->c3', 'a2->c2', 'a3->c3'),
 ('a1->c2', 'a2->c1', 'a3->c2'),
 ('a1->c1', 'a2->c3', 'a3->c1'),
 ('a1->c3', 'a2->c3', 'a3->c1'),
 ('a1->c2', 'a2->c2', 'a3->c3'),
 ('a1->c1', 'a2->c1', 'a3->c1'),
 ('a1->c3', 'a2->c1', 'a3->c1'),
 ('a1->c3', 'a2->c2', 'a3->c2'),
 ('a1->c3', 'a2->c1', 'a3->c3'),
 ('a1->c1', 'a2->c1', 'a3->c3'),
 ('a1->c3', 'a2->c3', 'a3->c3'),
 ('a1->c1', 'a2->c3', 'a3->c3'),
 ('a1->c2', 'a2->c1', 'a3->c1'),
 ('a1->c1', 'a2->c1', 'a3->c2'),
 ('a1->c2', 'a2->c2', 'a3->c1'),
 ('a1->c3', 'a2->c3', 'a3->c2'),
 ('a1->c2', 'a2->c3', 'a3->c3'),
 ('a1->c2', 'a2->c2', 'a3->c2'),
 ('a1->c1', 'a2->c2', 'a3->c2')]

There are 21 in all, 6 less than the 27 possible mappings between $A$ and $C$.  The missing ones correspond to the one-to-one mappings between $A$ and $C$ (i.e., $a_1 \rightarrow c_1$, $a_2 \rightarrow c_2$, $a_3 \rightarrow c_3$ and permutations).  Of course, these results drive home the point that, since the size of $B$ is smaller than $A$ and $C$, there is simply no way to support a one-to-one mapping.

One can experiment with replacing $B$ with a 4-element set $D$.  Doing so gives the expected 27 unique mappings $f: A \rightarrow C$ (at the expense of calculating 5184 possible combinations of going from $A$ to $C$ with a layover at $D$.  Obviously nobody should want to do this by hand.

So there you have it:  while not theoretically glamorous, functionality like this really allows for an exploration that couldn’t be done by hand.

Category Theory 5 – Retractions, Sections, and a New Definition for Inverse

With a clearer picture of determination and choice established in the last post, we now turn to refining the picture of the ‘division problem’ within the category of sets. The basic mechanic in both of these problems is trying to find the missing side of the triangle where the base is the direct mapping $h: A \rightarrow C$ and the two sides are the mappings $f:A \rightarrow B$ and $g:B \rightarrow C$ composed together, with the set $B$ acting as a waypoint. The only difference is between the two problems is which of the two sides is considered given and which is being solved for. In other words, the idea is to ‘divide’ both sides by $f$ to solve $f \circ g? = h$, in the determination case, or to ‘divide’ both sides by $g$ to solve $f? \circ g = h$ in the choice case, knowing full well that in certain circumstances an answer simply does not exist.

Lawvere and Schanuel, without explicitly saying clearly why in words, pay particular attention to a smaller subset of the determination and choice problems in which base of the triangle is an automorphism $I_A: A \rightarrow A$.  In this special case, the solutions to the determination and choice problems take the special name retractions and sections, respectively.

The two mathematical relationships that summarize these special cases are $ r \circ f = I$ and $f \circ s = I$ and where $f$ is the given mapping in whichever of the two problems is being considered (although, to be clear, $f$ will sometimes be called $g$ when it is the second leg of the composition and the context is clearer to do so). 

Of course, no matter which leg of the waypoint is specified as $f$, once a solution is found, both sides enter a dual relationship with the other and it then becomes a matter of taste as to which is the retraction and which is the section. Thus, $f$ can be regarded as the section to the retraction $r$ in the determination case and as the retraction to the section $s$ in the choice case. The following diagram illustrates this duality for a particular set of mappings between a two-element set (the label $A$ being suppressed for clarity) and the three-element set $B$.

The green lines correspond to the specific components of the mapping being solved for.  The dotted lines correspond to arrows that could be chosen differently (e.g. $b_2$ could point to either element in $A$). 

Once a section or a retraction have been found, other, more general ‘division’ problems can be solved.  In some sense, a retraction or a section are the primitives that unlock these problems.

The first of these general cases is built upon the retraction problem, where the mapping $y$, which goes from $A$ to a new set $T$, is given and the question is whether there exists a mapping $x$ from $B$ to $T$.  The ‘division problem’ we are trying to solve is defined by

\[ x \circ f = y \; . \]

The triangle portion of problem is summarized on the left where the ‘circular arrow’ connecting $A$ to $A$ is meant to remind us of the automorphism. 

Lawvere and Schanuel point out that this problem always has a solution when $x \equiv y \circ r$, since

\[ x \circ f = (y \circ r ) \circ f = y \circ ( r \circ f ) = y \circ I_A = y \; . \]

A specific instance of this, for the two- and three-element sets above, is given by the following diagram,

where the components of $y$ are the blue arrows, and the black and purple arrows are the original $f$ and $r$ dual pair of the retraction problem, respectively. The action of the retraction $r$ is to bring an element of $B$ back to $A$ where it can use the mapping $y$ to have access to the set $T$. The specific pieces of $x$ are:

\[ x(b_1) = (y \circ r) (b_1) = y (a_1) = t_1 \; , \]

\[ x(b_2) = (y \circ r) (b_2) = y (a_2) = t_4 \; , \]

and

\[ x(b_3) = (y \circ r) (b_3) = y (a_2) = t_4 \; . \]

In analogous way, the other, general cases is built upon the section problem, where the mapping $y$, which goes from a new set $T$ to $A$, is given and the question is whether there exists a mapping $x$ from $T$ to $B$. The ‘division problem’ we are now trying to solve is defined by

\[ f \circ x = y \; . \]

The triangle portion of problem is now summarized on the right where, again, the ‘circular arrow’ connecting $A$ to $A$ is meant to remind us of the automorphism. 

This problem always has a solution when $x \equiv s \circ y$, since

\[ f \circ x = f \circ (s \circ y) = (f \circ s) \circ y = I_A \circ y = y \; . \]

A specific instance of the section problem, again for the two- and three-element sets above, is given by the following diagram,

where the components of $y$ are still the blue arrows, and the black and green arrows are the original $f$ and $s$ dual pair of the section problem, respectively. The action of the section is to connect the action of the mapping $y: T \rightarrow $ to the set $B$ thereby putting each element of $T$ into the appropriate correspondence with elements in set $B$. The specific pieces of $x$ are:

\[ x(t_1) = (s \circ y) (t_1) = s (a_1) = b_1 \; , \]

\[ x(t_2) = (s \circ y) (t_2) = s (a_1) = b_1 \; , \]

\[ x(t_3) = (s \circ y) (t_3) = s (a_1) = b_1 \; , \]

and

\[ x(t_4) = (s \circ y) (t_4) = s (a_2) = b_3 \; . \]

One final note about the retraction and section problems. By making the base mapping of the triangle problem an automorphism, we are forces to supply additional requirements not imposed in the more general case. As was seen in the last post, an essential ingredient is that the left-side mapping from $A$ to $B$ be injective and the right-side mapping from $B$ to $A$ be surjective. It is instructive to look at how the number of retractions vary with size of the waypoint set $B$. The following diagram illustrates the retraction problem for the three cases with either one fewer, the same number of, or one more elements in $B$ than in $A$ (where $f$ and $g$ are used to specify the left- and right-side of the triangle).

Likewise, the following diagram illustrates how the number of sections vary through the same progression in the set $B$.

In both cases, there is a sense that the size of $B$ (i.e.  the number of the elements) is central to whether or not the problem has a solution and, if so, how many solutions exist.  This observation leads to two conclusions.

First, the definition of an inverse mapping, which was originally specified by the two conditions $g \circ f = I_A$ and $f \circ g = I_B$ (see this post for details), can now be recast as:

A map $f:A \rightarrow B$ is an isomorphism if there exists a unique inverse map $f^{-1}$ which is both a retraction and a section for $f$ such that $f \circ f^{-1} = I_B$ and $f^{-1} \circ f = I_A$.

Since the inverse must be both injective and surjective (since it is both a retraction and a section), we arrive at the familiar result that an inverse must be bijective. 

Second, one may wonder why all this machinery is needed to arrive at a result already familiar to mathematics centuries earlier.  According to Lawver and Schanuel, all this hard work will pay off when tackling categories where the objects are promoted from simple finite sets to richer sorts.  In particular, infinite sets and dynamical systems are cited but the jury is still out in this exploration as to whether their argument will ultimately be convincing.

Category Theory Part 4 – Determination and Choice

This month’s post on category theory picks up where the last post left off, with a look at the choice problem, the complimentary situation in map composition to the determination problem considered in the last post.  However, before digging in the choice problem directly this narrative will begin by revisiting the determination problem.  This go-back is one of the consequences of just following where the roots of discovery lead without having trampled down a path.  After having a month to reflect on the determination problem, it became clear that a discussion of the choice problem would greatly benefit from a little more detail in former.

One of the major failures of Lewvere’s and Schanuel’s book is not sufficiently connecting the combinatorics associated with composing maps with both problems.  While it is true that the text emphasizes the formal analogies between multiplication of ordinary numbers with compositions of maps, the usefulness of the analogy is never clearly stated.  Their exposition goes from the very simplistic (e.g. considering how to tell when two sets have the same number of elements) to the very complicated (retractions and sections as a way of defining inverse mappings) with no intermediary steps.  The purpose of this post is to provide those waypoints.  Along the way, it will become clearer why the authors focus on automorphisms.

To lay the groundwork, consider the finite sets $A$, $B$, and $C$ as shown in the diagram below, 

along with known mappings $f:A \rightarrow B$ and $h:A \rightarrow C$ and the yet-to-be-determined mapping $g:B \rightarrow C$. Set $A$ has two elements as does $C$, while $B$ has only one. In the last post, the importance of the mapping $f$ being injective was raised but not sufficiently explained. A clear distinction wasn’t made as to whether or not a non-injective map could actually solve the determination problem. We will detail the various mappings possible between the three sets shown in the figure and will show that not injective mappings can solve the determination problem but only in special contrived cases and never when the mapping between $A$ and $C$ is an automorphism.

Before diagraming each of the individual cases, let’s count the number of diagrams needed – one for each of possible sets of mappings ${f,g,h}$ (whether or not the set solves the determination problem). The number of mappings between any two sets is equal to the number of elements in the target or codomain set raised to the power of the number of elements in in the domain set. Since $B$ has only one element all mappings from any other set must look the same, that is every element of the domain maps to the singleton in $B$. The number of mappings from $B$ to $C$ is equal to the number of elements in $C$, which in this case is two. A similar computation shows that the number of distinct mappings from $A$ to $C$ is four. Already we can see that there is likely to be a problem in trying to solve the determination problem as there are only two distinct ways to go from $A$ to $C$ with a waypoint at $B$ whereas there are four ways in which to go from $A$ to $C$ directly. To see which of the four versions of $h$ are incompatible let’s look at the diagrams for the cases where $h$ is not injective followed by where it is.

In the case were the mapping $h:A \rightarrow C$ is not injective, there is always the solution to the determination problem despite the fact that the mapping from $f:A \rightarrow B $ also fails to be injective. The figure below shows the four possible sets of ${f,g,h}$ in this case with the left side showing the correct choice for the mapping $g$ and the right showing the wrong way to do it.

However, when the mapping $h$ is injective, that is to say that each element in $A$ maps to a unique element in $C$ (more generally every point in the codomain has, at most, one incoming arrow), then no solution to the determination problem can be found. This is shown in the following diagram for the two distinct possibilities for $h$, one of which is in an automorphism (with the identifications $a_1 = c_1$ and $a_2 = c_2$) and the other a permutation.

If we restrict ourselves to only the automorphism between $A$ and $C$, then we have to give up all hope of contriving a solution if $f$ is not injective, because, otherwise we would have to force some doubly-targeted element (i.e. $b_1$ in this case) into trying to do the impossible – to simultaneously point to two separate elements in its codomain. 

The special case where a mapping $g$ solves the determination problem for the case where $h=I_A$ is called a retraction for $r$ for $f$ and it obeys the relation $r \circ f = I_A$.

We are now in a better position to understand the choice problem and the role of the automorphisms in it.  The choice problem is the analog to the determination problem where the mappings $h: A \rightarrow C$ and $g:B \rightarrow C$ are known and where the map being sought is $f:A \rightarrow B$.   We’ve already cataloged all the possible sets of maps in the examples above and we know that the choice problem is only compatible with the case where $h$ is not-injective but this doesn’t give us enough information to understand the properties of $g$.  So, to really appreciate the choice problem we are going to have to change the composition of the sets.  The simplest approach is to enlarge $B$ to have 3-elements (the reason for skipping a 2-element $B$ should become obvious afterwards). 

Since we are interested in automorphisms as a special case of the 4 possible mappings from $A$ to $C$ we can substantially decrease the tedium of working through all the possible combinatorics.  In this special case, a solution to the choice problem is called a section and is usually denoted as $s$.   

Despite this simplification, there are still nine possible mappings into $B$ and eight possible mappings from it (for a total of 72 possibilities).  Fortunately, our job will get even easier after we examine the following diagram which focuses on one specific mapping from $B$. 

Of the nine possibilities that map the 2-element set (call it $A$ or $C$ as you wish) to $B$ none of them work because each element of $B$ aims at the same element in $A$. The remedy is to make sure that mapping $g$ sends at least one arrow to each element in $C$ – that is that $h$ is surjective.

It doesn’t matter exactly how the rest of the mapping is specified (for example $b_3$ could point to the other element).  The key feature is simply that all elements in $A$ are covered with at least on arrow head.

These two special cases are summarized quite nicely in the following diagram (adapted from the one on page 73 of Lawvere and Schanuel).

Next month, we’ll explore how to relate retractions and sections of a given map $f$ to a variety of ‘division problems’ in the category of finite sets and will we’ll explore a new insight into inverse mappings $f^{-1}$ that they provide.

Category Theory Part 3 – Determination and Division

Last month’s column talked about the mathematical and philosophical aspects of that one special mapping out of all the ones that exist: the isomorphism.  An isomorphism is special because only this mapping possesses the property that it can be ‘run in reverse’ and, like watching a movie played backwards, the end state (the codomain) returns exactly to the domain.  When this occurs, the original mapping has an inverse.

Lawvere and Schanuel (see here for details) emphasize that much like the arithmetic inverse of 2 is 1/2, so too can we interpret the mapping inverse as a sort of division problem.  Along those lines they introduce the reader to two types of ‘division’ problems – known as the determination and choice problems, respectively – within the context of mappings and category theory.  The fact that there are two types of ‘division mappings’ that one must contend with, in contrast to only one ‘division number’ in the arithmetic case, is a consequence of the fact that we don’t require, in general, an equal number of elements in the two sets being mapped.  The authors fail to explore this last point sufficiently, which leads to their discussion being somewhat impenetrable (a cursory search on the web leads me to conclude that this is a universal finding of people who study their text).  They also add insult to injury by interleaving the discussion of the two problems without much of an organized plan and with several smaller pedagogical no-nos. 

So, to address these shortcomings, this post will look at the first of these two problems, the determination problem, while deferring the choice problem to next month’s column.  In each case, the abstract definition of the problem will be given first follow by a concrete example.  The only remaining question is what specific concrete example to use.  Combining a bit of whimsey with the fact that I just introduced a friend to the 1979 movie The Warriors, led me to pull my examples from it.

For those unfamiliar with the movie, the plot centers around a group from gang called the Warriors who must fight their way through a host of rival gangs as they try to travel from the Bronx to Coney Island.  Each of the rival gangs has a particular set of colors and trademark approach to their violence.  Police officers and ordinary people also show up but in limited roles.  The finite sets based on this story are: a set of some of the gangs, a set of some of the characters, a set of some of the uniforms/colors worn by the gangs, and a set of some of the weapons. (Credit is due to The Warriors Gangs – The Warriors Movie Site which provided some of the finer details about the gangs).

Division Problem 1 – Determination Problem

Abstractly, the determination problem asks the following question.  If $A$, $B$, and $C$ are sets and $f:A \rightarrow B$ and $h:A \rightarrow C$ do we have enough information to determine what mapping, if any, exists between $B$ and $C$.

Pictorially, the question amounts to asking if a mapping $g$ exists such that the following triangle picture closes.

For our concrete example, let’s take the set $A$ to be three members of the titular gang, the set $B$ to be each member’s weapon of choice, and the set $C$ the gang each member most hates. Each member has his own unique weapon of choice (as seen by the mapping $f:A \rightarrow B$): $f(Ajax)=Bat$, $f(Cochise)=Pipe$, and $f(Swan)=Knife$. However, two of the Warriors rank the Rogues at the top of their most-hated lists ($h(Cochise)=Rogues=h(Swan)$ while another member views the Baseball Furies as easily the most detestable ($h(Ajax)=Furies$). Pictorially, the situation looks like:

The question is: can we generate a mapping $g$ from the set of preferred weapons to the set of most-hated gangs such that $g \circ f = h$.  In the parlance of category theory finding $g$ means that ‘$ h$ is a function of $f$’. 

A little bit of thought should convince the reader that a mapping $g$ defined in the following diagram fits the bill.

Of course we can check each case separately as well by the following list:

  • $ h(Ajax) = Furies = g \circ f (Ajax) = g(Bat) = Furies$
  • $h(Cochise) = Rogues = g \circ f (Cochise) = g(Pipe) = Rogues$
  • $h(Swan) = Rogues = g \circ f (Swan) = g(Knife) = Rogues$

Since we’ve exhausted all the elements of the set $A$ and, in each case, found that $h = g \circ f$, we can say that we’ve found a solution to the retraction problem for our little gang-warfare scenario.

Before we interpret that solution in plain language, let’s look at a related situation where a retraction can never be found.  We imagine that Cochise now prefers the Bat as his weapon of choice but that he maintains his deep and abiding hatred for the Rogues.  Let’s make the same list as above and see if anything is now amiss:

  • $h(Ajax) = Furies = g \circ f (Ajax) = g(Bat) = Furies$
  • $h(Cochise) = Rogues = g \circ f (Cochise) = g(Bat) = Rogues$
  • $h(Swan) = Rogues = g \circ f (Swan) = g(Knife) = Rogues$

By the time we got to the second line, we see that we have a problem.  Ajax’s case requires that $g(Bat) = Furies$ while Cochise’s situation requires $g(Bat)=Rogues$.  Under the rules of set mapping, one, and only one, arrow can start on an element in the domain set.  Pictorially, we have a situation where, to reconcile Ajax’s mode-de-guerre with that of Cochise requires two arrows starting off their favorite weapon the baseball bat (note extraneous line have been omitted for clarity).

The mathematical/philosophical machinery invoked leaves us with a clear picture of why a retraction doesn’t work in the latter case even though it works in the former case. The key difference is that in the first scenario $f$ had each element in $A$ point to a unique element in $B$. This requirement makes $f$ an injective mapping and this will be the central aspect of all mappings that possess a retraction. An important correllary is that the size of set $B$ must be at least as big as $A$ so that an injection can be setup (compare with the case if there were only two types of weapons; then one of the Warriors would, necessarily, need to favor the same weapon as another member).

Interestingly, while the category theory explanation of the contrast between the two cases is clear, finding plain language to describe the difference is quite hard.  One might think that a successfully constructed retraction in the first case would mean that we could conclude with certainty that if we see Ajax reach for Bat then he’s coming after one of the Furies.  But a good defense lawyer could argue that he merely wants to go to the batting cage.  Likewise, the lack of a retraction might seem daunting in the second case but if we find a Rogue beaten up with a baseball bat, any NYPD detective would have probable cause to talk to Cochise.  It seems that this aspect of category theory doesn’t adapt itself very well to real-world situations with nearly unlimited possibilities, uncertainties, and confounding variables, despite Lawvere’s and Schanuel’s attempts at doing so.  But the jury is still out in this exploration and maybe new ideas will emerge that force a reassessment.

Next month we’ll look at the choice problem but we’ll drop the gang violence in favor of more prosaic sets.

Category Theory Part 2 – Basics of Isomorphisms

This month’s installment continues the journey through Lawvere’s and Schanuel’s book on category theory (see last month’s installment).  In that article, the basic idea of category theory emerged as a set of ‘meta-rules’ that focused not on the elements of two sets but on the mappings between those sets.  In addition, the authors emphasized analogies with scientific thought and with ordinary arithmetic.

That section was rather short, though, and in this post the machinery gets to be a bit more complicated.  The central theme for the second section of the book (Article II and its accompanying Sessions) centers on isomorphisms.  This month’s installment will focus only on the basics of isomorphisms, with the hooks to future blogs on the topics of choice (i.e., sections of a map) and determination (i.e., retractions of a map).

It is interesting to note that a suspicion voiced in last month’s column seems to be on target: namely, that much of this work is philosophical rather than just mathematical.  The first indication of this is that Lawvere and Schanuel try to motivate the concept of isomorphisms by imagining a point in humanitie’s past where the formal concept of numbers had yet to be invented.  The question that they pose is, just how did man know when two things were equal?

The answer that they provide is a simple one.  Humanity devoid of formal numbers would not be able to count the number of elements in a set and compare that total to the total from another to establish that the sets have the same size.  But they could try to put the elements in each set into a one-to-one correspondence.

The example they construct consists of a nuclear family (father, mother, and child) and a collection of items that belong to or are identified with the person. 

The idea is that no matter how unsophisticated the family mathematics may be, each member of this group can claim treasure all his own and, in doing so, realize that the family group is the same size as the collection of personal treasures.  This example is a bit too ‘tribal’ for my taste, but it did get me thinking of those old heist movies where, after having made off with the loot, the criminals divvy up the cash using the ‘one for you, one for me’ scheme.

Be that as it may, there is an even more subtle notion lurking under the surface of this basic concept of equality.  Not only does each person have a treasure to call their own, but each treasure has an owner.  This bidirectionality is more special than the rules discussed in the category of sets.  In the latter, the only condition on a proper mapping is that each element in the set labeled as the domain has an arrow departing from it that connects with one and only one element in the codomain.  There was no condition placed on each element of the codomain connecting it back to one and only one element in the domain. 

A bit of reflection should, in general, convince one to reject a requirement that an element in the codomain must connect back to the domain.  Imposing such a requirement would mean that entire propositions would be impossible to express.  For example, consider the set of students {Andy, Barbara, Carl, Denise} who have just taken a test, and the set of possible grade outcomes {A,B,C,D,F}. 

The statement “No student failed the exam” could not be encoded into the language of category theory if every grade level needed to link to a student (and only one student).  (It is comical to think about a world were such a requirement were imposed.  Classes could never have any number of students but 5, and then, even if the scores on a 100-point exam were 1, 2, 3, 4, and 5, the top scorer would walk away with an A.)

So, the idea that the lines connecting the domain to the codomain can, in certain classes of mapping, be reversed is quite remarkable.  When a mapping possesses this bidirectionality it is called an isomorphism, and the formal definition is as follows:

A map $f: A \rightarrow B$ is called an isomorphism or invertible map, if there is a map $g:B \rightarrow A$ such that $g \circ f = I_A$ and $ f \circ g = I_B $.

Two special maps, $I_A$ and $I_B$, appear, which are the identity maps that take elements in the sets $A$ and $B$ back into themselves.  The map $g$ is also special in that it is unique and so it can be, without a loss of generality, denoted as $f^{-1}$.

It might be tempting to think that having both composition rules is redundant (e.g., $g \circ f = I_A$ has the same information as $ f \circ g = I_B$) but, as the following example shows, both are needed.

For these mappings, $g \circ f = I_A$ but $f \circ g \neq I_B$. 

It’s also easy to look past this definition and think that more must be needed, but the authors soon dispel that idea as well by showing that the following associated properties follow from it – namely that an isomorphism is:

Reflexive$A$ is isomorphic to $A$
SymmetricIf $A$ is isomorphic to $B$, then $B$ is isomorphic to $A$
TransitiveIf $A$ is isomorphic to $B$, and $B$ is isomorphic to $C$, then $A$ is isomorphic to $C$

These associated properties, and a host of other ones as well, are relatively easy to prove, but can look strange to one not used to the manipulations.  The primary strategy involves introducing the identity maps (either $I_A$ or $I_B$) at opportune moments.   

For example, the proof of the transitive property goes as:

  1. $A$ isomorphic to $B$ means $f:A \rightarrow B$ and that there is a $g: B \rightarrow A$ such that $f \circ g = I_B$ and $g \circ f  = I_A$.
  2. $B$ isomorphic to $C$ means $h:B \rightarrow C$ and that there is a $k: C \rightarrow B$ such that $h \circ k = I_C$ and $k \circ h  = I_B$.
  3. Then the map $h \circ f: A \rightarrow C$ (by composition).
  4. We now need to show that there exists an inverse map $ (h \circ f)^{-1}:C \rightarrow A$ such that $(h \circ f)^{-1}  \circ (h \circ f)  = I_A$ and $(h \circ f) \circ (h \circ f)^{-1} = I_C$
  5. From the definition of the inverse $(h \circ f)^{-1}$ must be equal to $g \circ k$.
  6. Putting it all together we see that:
    1. $(h \circ f)^{-1} \circ (h \circ f) = (g \circ k) \circ (h \circ f) = $ $g \circ (k \circ h) \circ f = I_B = ( g \circ I_B) \circ f = g \circ f = I_A$
    2. $(h \circ f) \circ (h \circ f) ^{-1} = (h \circ f) \circ (g \circ k) = $ $h \circ (f \circ g) \circ k = h \circ I_B \circ k = h \circ (I_C \circ k) = h \circ k = I_C$

Next month we’ll look at an equivalent definition of an isomorphism that has analogies to division in arithmetic using the concepts of sections and retractions of maps. 

Category Theory Part 1 – An Introduction

Since it is fashionable to start a new year with a new resolution, I thought it might be fun to try something new within the domain of this blog.  So, starting this month, the content and approach will be a little different.  I’ll be working through Conceptual Mathematics: A first introduction to categories, by F. William Lawvere and Stephen H. Schanuel. 

I had picked up this book some time back based on a whim and the promise on the back cover (obviously not quite a wing and a prayer but one improvises).  The authors claimed that categories lead to “remarkable unification and simplification of mathematics” and such a bold claim easily caught my attention and sparked my curiosity.   Thumbing through the pages made it clear that this work sits as much in the realm of philosophy as it does mathematics and that moved it from a possible purchase to an immediate sale.  However, life being life, I only got through the first chapter or so before overriding pressures shoved the book to the back burner, where it remained for many years. 

Now the time is right to return to this work but the process will be a little different.  Rather than digesting the material first and writing about it afterward, I’ll be chronicling my progress through material largely foreign presented in a work which is also substantially unfamiliar.  It should prove an interesting journey but perhaps not a very successful one in the long run; but only time will tell.

This first installment covers the introductory material in the book starting with some basic ontology and epistemology of physics and ending with a distillation of the basic properties concerning the category of finite sets.

Like any other mathematical scheme, the category of finite sets consists of two components: the basic objects or data comprising the category and a set of rules for how to relate them.  The curious thing about categories is that it seems that the rules are, in some sense, meta-rules.   

To better understand this point, consider a diagram from early in the book showing a standard picture of the trajectory of an object (left), in this case the flight of a bird, and the ontological breakdown of the motion mathematically (right).

At each instant in time, the bird is at a specific location in space.  The physicist’s standard approach is to consider the trajectory as a mapping $f$ from time to space which can symbolically be written as

\[ f:t \rightarrow {\mathbb R}^3 \equiv f:t \rightarrow {\mathbb R} \times {\mathbb R}^2 \; , \]

Where the tensor product in the last relation means that ‘space’ ${\mathbb R}^3$ can be decomposed into ‘line’ ${\mathbb R}$ and ‘plane’ ${\mathbb R}^2$. 

I think most physicists stop there with the job accomplished.  Being one myself, I pondered for a while what was so important about this point that is deserved a special place in the introduction – the part of the book that a perspective buyer would be most likely to read before purchasing.  While I am not completely sold on my interpretation, I believe that the authors were trying to say that the mappings were the basic objects in categories rather than simply the tools to understand the more ‘physical’ objects of time and space.

To further support this interpretation, I’ll point out that the authors list (p 17) the basic ingredients (as they call them) to have a category or ‘mathematical universe’ (again their words).  According to this terminology, these ingredients are

Data for the category:

  • Objects/sets:  $A$, $B$, $C$, $\ldots$
  • Maps between sets: $f:A \rightarrow B$ or, synonymously $A \xrightarrow{f} B$.  The maps only have one stipulation, that every element in the domain $A$ goes to some element in the range (or codomain) $B$.
  • Identity Maps (one per object): $A \xrightarrow{i_A} A$
  • Composition of Maps: assigns to each pair of compatible maps (say a map from $f:A \rightarrow B$ followed by another map from $g:B \rightarrow C$) a single map that accomplishes the job at one go (a map directly from $h: A \rightarrow C$) with $h \equiv g \circ f$ read as $g$ follows $f$

­Rules for the category

  • The identity laws
    • If $A \xrightarrow{i_A} A \xrightarrow{g} B$, then $A \xrightarrow{g \circ i_A} B$
    • If $A \xrightarrow{f} B \xrightarrow{i_B} B$, then $A \xrightarrow{i_B \circ f} B$
  • The associative law
    • If $A \xrightarrow{f} B \xrightarrow{g} C \xrightarrow{h} D$, then $A \xrightarrow{h \circ (g \circ f) = (h \circ g) \circ f} D$

They emphasize that the rules, are important, even crucial.  I take it that they mean that the key insight is the action between the maps and not the objects the maps relate and that is why I characterize the rules for categories as meta-rules.

These meta-rules center on the composition of the maps.  The identity laws are fairly straightforward with one caveat.  The two identity map rules are required since the maps are not required to be bijections, unlike the usual case in physics where the maps are almost always bijections.  This extra freedom makes a little more difficult to think about the associativity rule.  The following diagram, adapted from the exercises starting on page 18, shows that even in the absence of clear inverse for each map associativity holds; if ones starts with some element in the first set it has to go somewhere in the last set and it doesn’t matter if one thinks of grouping the intermediate steps one way or the other.

The final observation is that the authors make a point that there is a strong analogy between the multiplication of real numbers and the composition of maps.  This analogy derives from the combinatorics associated with the way in which one can relate $n$ elements in one set to $m$ elements in another.

Mathematically, the number of different maps between $A$ and $B$ is given in terms of the number of elements in each (denoted by $N_A$ and $N_B$ respectively) by

\[ {N_B}^{N_A} \; , \]

which for the case in the diagram above is $ 2^3 = 8$ as expected.

The exact insights that this analogy provides remain still a mystery but there is always next month.