Author Archive: Conrad Schiff

Time Series 3 – Exponential Smoothing and Holt-Winter

In the last post, we examined the Holt-Winter scheme for tracking the level, trend, and seasonal variations in a time series in a sequential fashion with some synthetic data designed to illustrate the algorithm in as clean a way as possible.  In this post, we’ll try the Holt-Winter method against real world data for US housing sales and will set some of the context for why the method works by comparing it to a related technique called the moving average.

The data analyzed here were obtained from RedFin (https://www.redfin.com/news/data-center/) but it isn’t clear for how long RedFin will continue to make these data public as they list the data as being ‘temporarily released’.  As a result, I’ve linked the data file I’ve used here.

We’re going to approach these data in two ways.  The first is by taking a historical look at the patterns in the data from the vantage point of hindsight on the entire span of home sales having been collected.  In the second approach, we imagine what an agent working in the past thinks as the data come in one record at a time. 

The historical look starts with an overview of the number of homes sold in the time period starting in Feb 2012 and ending at May 2023.

These data show both seasonal and overall trend variations and so our expectation might be that Holt-Winter would do a good job but note two things: First, with the exception of the first pandemic year of 2020, each of the years shows the same pattern: sales are low in the winter months and strong in the summer ones. Second the trend (most easily seen by focusing on the summer peak) shows four distinct regions: a) from 2012-2017 there is an overall upward trend, b) from 2017-2020 the trend in now downward with a much shallower slope, c) the start of the pandemic lockdowns in 2020 breaks the smoothness of the trend and then the trend again has a positive slope over 2020-2021, and d) the trend is strongly downward afterwards.  These data exhibit a real-world richness that the contrived data used in the last post did not and they should prove a solid test for a time series analysis agent/algorithm.

Depending on how ‘intelligent’ we want our analysis agent to be we could look at a variety of other factors to explain or inform these features.  For our purposes, we’ll content ourselves with looking at one other parameter, the median home sales price, mostly to satisfy our human curiosity.

These data look much more orderly in their trend and seasonal variation over the time span from 2012-2020.  Afterwards, there isn’t a clear pattern in terms of trend and season. 

Our final historical analysis will be to try to understand the overall pattern of the data using a moving average defined as:

\[ {\bar x}_{k,n} = \frac{1}{n} \sum_{i =k-n/2}^{k+n/2} x_i \; . \]

The index $k$ specifies to which point of the underlying and $n$ the number of points to be used in the moving average.  Despite the notation, $n$ is best when odd so that there are as many points before the $k$th one as there are after as this prevents the moving average from introducing a bias which shifts a peak in the average off of the place in the data where it occurs.  In addition, there is an art in the selection of the value of $n$ between it being too small, thereby failing to smooth out unwanted fluctuations, and being too large which smears out the desired patterns.  For these data, $n = 5$.  The resulting moving average (in solid black overlaying the original data in the red dashed line) is:

Any agent using this technique would clearly be able to describe the data as having a period of one year with a peak in the middle and perhaps an overall upward trend from 2012 to 2022 but then a sharp decline afterwards.  But two caveats are in order.  First and the most important one, the agent employing this technique to estimate a smoothed value on the $k$th time step must wait until at least $n/2$ additional future points have come in.  This requirement usually precludes being able to perform predictions in real time.  The second is that the moving average is computationally burdensome when $n$ is large.

By contrast, the Holt-Winter method can be used by an agent needing to analyze in real time and it is computationally clean.  At the heart of the Holt-Winter algorithm is the notion of exponential smoothing where the smoothed value at the $k$th step, $s_k$, is determined by the previous smoothed value $s_{k-1}$ and the current raw value $x_k$ according to

\[ s_k = \alpha x_k + (1-\alpha) s_{k-1} \; . \]

Since $s_{k-1}$ was determined from a similar expression at the time point $k-1$, one can back substitute to eliminate all the smoothed values $s$ on the right-hand side in favor of the raw ones $x$ to get

\[ s_k  = \alpha x_k + (1-\alpha)x_{k-1} + (1-\alpha)^2 x_{k-2} + \cdots + (1-\alpha)^k x_0 \; . \]

This expression shows that the smoothed value $s_k$ is a weighted average of all the previous points making it analogous to the sequential averaging discussed in a previous post but the exponential weighting by $(1-\alpha)^n$ makes the resulting sequence $s_k$ look more like the moving average.  In some sense, the exponential smoothing straddles the sequential and moving averages giving the computational convenience of the former while providing the latter’s ability to follow variations and trends.

How closely the exponentially smoothed sequence matches a given $n$-point moving average depends on the selection of the value of $\alpha$.  For example, with $\alpha = 0.2$ the exponentially smoothed curve gives

whereas $\alpha = 0.4$ gives

Of the two of these, the one with $\alpha=0.4$ much more closely matches the 5-point moving average used above. 

The Holt-Winter approach using three separate applications of exponential smoothing, hence the need for the three specified parameters $\alpha$, $\beta$, and $\gamma$.  Leslie Major presents an method for optimizing the selection of these three parameters in her video How to Holts Winters Method in Excel & optimize Alpha, Beta & Gamma

We’ll skip this step and simply use some values informed by the best practices that Major (and other YouTubers) note.

The long-term predictions given by our real time agent are pretty good in the time span 2013-2018.  For example, a 24-month prediction made in February 2013 looks like

Likewise, a 24-month prediction in June 2017 looks like

Both have good agreement with a few areas of over or under-estimation.  The most egregious error is the significant overshoot in 2019 which is absent in the 12-month prediction made a year later. 

All told, the real time agent does an excellent job of predicting in the moment but it isn’t perfect as is seen by how the one-month predictions falter when the pandemic hit.

Time Series 2 – Introduction to Holt-Winter

In the last blog, I presented a simple sequential way of analyzing a time series as data are obtained.  In that post, the average of any moment $x^n$ was obtained in real-time by simply tracking the appropriate sums and number of points seen.  Of course, in a real world application, there would have to be a bit more intelligence built into the algorithm to allow an agent employing it to recognize when a datum is corrupted or bad or missing (all real world problems) and to exclude these point both from the running sums and from the number of points processed. 

This month, we look at a more sophisticated algorithm for analyzing trends and patterns in a time series and for projecting that analysis into the future sequential using these patterns.  The algorithm is a favorite in the business community because, once an initial ‘training’ set has been digested, the agent can update trends and patterns with each new datum and then forecast into the future.  The algorithm is called the Holt-Winter triple exponential smoothing and it has been used in the realm of business analytics for forecasting the number of home purchases, the revenue from soft drink sales, ridership on Amtrak, and so on, based on a historical time series of data.

Being totally unfamiliar with this algorithm until recently, I decided to follow and expand upon the fine video by Leslie Major entitled How to Holts Winters Method in Excel & optimize Alpha, Beta & Gamma.  In hindsight, this was a very wise thing to do because there are quite a few subtle choices for initializing the sequential process and the business community focuses predominantly on using the algorithm and not explaining rationale for the choices being followed.

For this first foray, I am using the ‘toy’ data set that Majors constructs for this tutorial.  The data set is clean and well-behaved but, unfortunately, is not available from any link discernable associated with the video but I have reconstructed it (with a lot a pauses) and make it available here

The data are sales $W$ of an imaginary item, which, in deference to decades of tradition, I call a widget.  The time period is quarterly and a plot of the definitive sales $W$ from the first quarter of 2003 (January 1, 2003) through the fourth quarter of 2018 (October 1, 2018) shows both seasonal variations (with a monotonic ordering from the first quarter as the lowest to the fourth quarter as the highest)

as well as a definite upward trend for each quarter.

The Holt-Winter algorithm starts by assuming that the first year of data are initially available and that a first guess for a set of three parameters $\alpha$, $\beta$, and $\gamma$ (associated with the level $L$, the trend $T$, and the seasonal $S$ values for the number of widget sales, respectively) are known.  We’ll talk about how to revise these assumptions after the fundamentals of the method are presented.

The algorithm consists of 7 steps.  The prescribed way to initialize the data structures tracking the level, trend, and seasonal values is found in steps 1-4, step 5 is a boot-strap step initialization needed to start the sequential algorithm proper once the first new datum is obtained and steps 6-7 comprise the iterated loop step used for all subsequent data, wherein an update to the current interval is made (step 6) and then a forecast into the future is made (step 7).

In detail these steps require an agent to:

  1. Determine the number $P$ of intervals in the period; in this case $P = 4$ since the data are collected quarterly. 
  2. Gather the first period of data (here a year) from which the algorithm can ‘learn’ how to statistically characterize it.
  3. Calculate the average $A$ of the data in the first period.
  4. Calculate the ratio of each interval $i$ in the first period to the average to get the seasonal scalings $S_i = \frac{V_i}{A}$.
  5. Once the first new datum $V_i$ ($i=5$) (for the first interval in the second period) comes in, the agent then bootstraps by estimating:
    1. the level in the first interval by making a seasonal adjustment.  Since the seasonal level is not yet known, the agent uses the seasonal value in the previous period in the first interval using $L_i = \frac{V_i}{S_{i-P}}$;
    1. the trend of the first interval in the second period using $T_{i} = \frac{V_i}{S_{i-P}} – \frac{V_{i-1}}{S_{i-1}}$.  This odd looking formula is basically the finite difference between the first interval of the second period and the last interval of the first period, each seasonally adjusted.  Again, since the seasonal level is not yet know, the agent uses the seasonal value from the corresponding earlier interval for the first interval of the second period;
    1. the seasonal ratio of the first interval in the second period using $S_i = \gamma \frac{V_i}{L_i} + (1-\gamma) S_{i-P}$.
  6. Now, the agent can begin sequential updating in earnest by using all of the weighted or blended averages of the data in current and past intervals to update:
    1. the level using $L_i = \alpha \frac{V_i}{S_{i-P}} + (1-\alpha)(L_{t-1} + T_{t-1})$
    1. the trend using $T_i = \beta ( L_i – L_{i-1} ) + (1-\beta) T_{t-1}$
    1. the seasonal ratio (again) using $S_i = \gamma \frac{V_i}{L_i} + (1-\gamma) S_{i-P}$
  7. Finally, the agent can forecast as far forward as desired (using $F_{i+k} = (L_i + k \cdot T_i) S_{i – P + k}$. where $k$ is an integer representing the number of ‘intervals’ ahead to be forecasted.  There are some subtleties associated with what the agent can do with a finite number of historical levels of the seasonal ratios, so, depending on application, specific approximations for $S_{i – P + k}$ must be made.

Using the widget sales data, the first 4-quarter forecast compares favorable to the actuals as seen in the following plot.

When forecasting only a quarter ahead, as Major herself does, the results are also qualitatively quite good.

Finally, a note about selecting the values of $\alpha$, $\beta$, and $\gamma$.  There is a general rule of thumb for initial guesses but that the way to nail down the best values is to use an optimizer to minimize the RMS error between a forecast and the actuals.  Majors discusses all of these points and shows how Excel can be used in her video to get even better agreement.

For next month, we’ll talk about the roots of the Holt-Winter algorithm in the exponential smoother (since it is applied to three parameters – level, trend, and seasonal ratio – that is why the algorithm is often called triple exponential smoothing).

Time Series 1 – Sequential Averaging

A hallmark of intelligence is the ability to anticipate and plan as events occur as time elapses.  Evaluating and judging each event sequentially is how each of us lives but, perhaps ironically, this is not how most of us are trained to evaluate a time series of data.  Typically, we collect quantitative and/or qualitative data for some time span, time tagging the value of each event, and then we analyze the entire data set in as one large batch.  This technique works well for careful, albeit leisurely, exploration of the world around us.

Having an intelligent agent (organic or artificial) working in the world at large necessitates using both techniques – real time evaluation in some circumstances and batch evaluation in others – along with the ability to distinguish when to use one versus the other.

Given the ubiquity of time series in our day-to-day interactions with the world (seasonal temperatures, stock market motion, radio and TV signals) it seems worthwhile to spend some space investigating a variety of techniques for analyzing these data.

As a warmup exercise that demonstrates the differences between real time and batch nuances involved, consider random numbers pulled from a normal distribution with a mean ($\mu = 11$) and standard deviation ($\sigma = 1.5$). 

Suppose that we pull $N=1000$ samples from this distribution.  We would expect that the sample mean, ${\bar x}$, and sample deviation $S_x$ to be equal to the population mean and standard deviation to within 3% (given that $1/\sqrt(1000) \approx 0.03$).  Using numpy, we can put that expectation to the test, finding that for one particular set of realizations that the batch estimation over the entire 1000-sample set is ${\bar x} = 11.08$ and $S_x = 1.505$.

But how would we estimate the sample mean and standard deviation as points come in?  With the first point, we would be forced to conclude that ${\bar x} = x_0$ and $S_x = 0$.  For subsequent points, we don’t need to hold onto the values of the individual measurements (unless we want to).  We can develop an iterative computation starting with the definition of an average

\[ {\bar x} = \frac{1}{N} \sum_{i=1}^N x_i \; .\]

Separating out the most recent point $x_N$ and multiplying the sum over the first $N-1$ points by the $1 = (N-1)/(N-1)$ gives

\[ {\bar x} = \frac{N-1}{N-1} \frac{1}{N} \left( \sum_{i=1}^{N-1} x_i \right) + \frac{1}{N} x_N \; .\]

Recognizing that the first term contains the average ${\bar x}^{(-)}$ over the first $N-1$ points, this expression can be rewritten as

\[ {\bar x} = \frac{N-1}{N} {\bar x}^{(-)} + \frac{1}{N} x_N \; .\]

Some authors then expand the first term and collect factors proportional to $1/N$ to get

\[ {\bar x} = {\bar x}^{(-)} + \frac{1}{N} \left( x_N – {\bar x}^{(-)} \right) \; .\]

Either of these iterative forms can be used to keep a running tally of the mean.  But since the number of points in the estimation successively increases as a function of time, we should expect that difference between the running average and the final batch estimation to also be a function of time.

While there is definitive theory that tells us the difference between the running tally and the population mean there isn’t any theory that characterizes how it should rank relative to the batch estimation other than the generic expectation that as the number of points used in the running tally approaches the number in the batch estimation that the two should converge.  Of course, the final value must exactly agree as there were no approximations made in the algebraic manipulations above.

To characterize the performance of the running tally, we look at a variety of experiments.

In the first experiment, the running tally fluctuates about the batch mean before settling in and falling effectively on top.

But in the second experiment, the running tally starts far from the batch mean and (with only the exception of the first point) stays above the until quite late in the evolution.

Looking at the distribution of the samples on the left-hand plot shows that there was an overall bias relative to the batch mean with a downward trend, illustrating how tricky real data can be.

One additional note, the same scheme can be used to keep a running tally on the average of $x^2$ allowing for a real time update of the standard deviation from the relation $\sigma^2 = {\bar {x^2}} – {\bar x}^2$.

As the second case demonstrates, our real time agent may have a significantly different understanding of the statistics than the agent who can wait and reflect over a whole run of samples (although there is a philosophical point as to what ‘whole run’ means).  In the months to come we’ll explore some of the techniques in both batch and sequential processing.

The Measure of Man

Recently I started re-reading Umberto Eco’s The Name of the Rose.  The action is set in the early part of the 14th century (November of 1327 to be precise) at a Benedictine abbey whose name, to quote the fictitious narrator, “it is only right and pious now to omit.”  One of the many themes of this novel, reflecting the scholastic philosophy ascendant at that time, deals with the nature of being human.  Answering that ontological/anthropological question includes, as correlative concepts, the licitness of laughter and a categorization of the essential nature of what separates us from the lower animals, which philosophers often will call the beasts.

While being removed from our time by approximately 800 intervening years, the basic question as to ‘what is man’ is still as operative today as it was then or as it ever was.  The yardsticks have changed but the question remains.

The aforementioned Scholastics crafted an image of the universe similar to what is shown below.

The universe consists of an abundant material world (contained in the blue triangle) with a hierarchy of complexity that I think a modern biologist would agree with (after he, of course, added all the still controversial divisions associated with microscopic life).  At the lowest rung is the inanimate matter that makes bulk of the world around us in the form of solid, liquid, gas, and plasma.  The next rung up (again skipping microscopic life, which, while important, was unknown prior to the late 1600s) consists of what was often called the vegetable kingdom comprised of the plants and plant-like life around us.  Animals of all sorts, excepting man (philosophers typically call this subset ‘the beasts’ for brevity), comprise the next level up. The pinnacle of this material world is occupied by humans, a point that, although some among us would wish it not to be true, is difficult to refute.

The universe also consists of an abundant spiritual world (contained in the green triangle) with a hierarchy of complexity that is far more controversial because it elusively remains beyond our physical means of detection.  For a Christian philosopher in the Middle Ages, the top of the hierarchy is the triune God composed of the Father, Son, and Holy Spirit as 3 persons in one being.  Below the Trinity, the Scholastics imagined a world teaming with angels, the composition of which is traditionally divided into three choirs each comprised of three species according to the analysis of Thomas Aquinas.  Finally, the lowest level of the spiritual realm is occupied by human beings. 

Thus, a scholastic philosopher recognizes that the nature of man is a unique union between material and the spiritual, but the measure of man – what exactly separates him at that intersection from the more numerous species belonging entirely to either world – isn’t so clear. 

One might think that as Western thought transitioned from the Middle Ages into the Renaissance and eventually through the Enlightenment and into our current Industrial/Digital age that the question would lose all of its force, but it stubbornly remains; only the trappings have changed.  Where once an operative question was ‘how many angels could dance on a pinhead’ (and rightly so) we now ask questions about how many of those spiritual beings we label as ‘AI’ are needed to replace a screenwriter or a lawyer. 

So, let’s examine some of the various activities that have been put forth as ways that Man separates himself both the beasts and, where appropriate, from the AI. 

For our first activity, there is Man’s propensity to make or build.  One need only glance at one of the world’s greatest cities, say Manhattan, to be impressed with the size, scale, and variety of construction that is present.  But construction is not unique to Man as a variety of insects, for example termites, build elaborate communal living structures.  And generative AI often returns structures far more elaborate than envisioned by most of the world’s architects (Gaudi not withstanding).

Others have argued that language and communication are what separate Man but then what does one do with gorilla communication or parrots who clearly ‘talk’?   And where does ChatGPT’s dialogs and replies fit into this schema?

The ability to reason is often proffered as a possibility and the impressive amount of reasoning that has been produced, particularly in the fields of the science and mathematics, seems to reflect the unique nature of Man.  But at Peter Kreeft points out in his lectures entitled Ethics: A History of Moral Thought, beasts also show a degree of reasoning.  He cites an example of a dog pursuing a hare who comes to a three-fold fork in the road.  After sniffing two of the trails with no success, the dog immediately pursues the hare down the third without bothering to sniff.  And, of course, expert systems and symbolic logic programs have been ‘reasoning’ for years and remain important components in many fields.

The list could go on, but the point should already be clear.  Much like one of Plato’s Socratic dialogs, this argument over what separates Man from the other autonomous agents that inhabit the world around him (beasts and AI) is not easy to resolve.  Clearly there is some character in Man that sets him apart as he seems to be the only entity that exhibits the ability in the material world to make moral judgements based on The Good, The True, and The Beautiful but distilling what that ability is remains elusive. 

This elusive, ineffable nature of man is symbolized in Eco’s arguments in The Name of the Rose by the continued debate within the monastic community over the nature of laughter.  The following snippet of dialog between Jorge of Burgos and William of Baskerville give a flavor of that debate:

“Baths are a good thing,” Jorge said, “and Aquinas himself advises them for dispelling sadness, which can be a bad passion when it is not addressed to an evil that can be dispelled through boldness. Baths restore the balance of the humors. Laughter shakes the body, distorts the features of the face, makes man similar to the monkey.”

“Monkeys do not laugh; laughter is proper to man, it is a sign of his rationality,” William said.

“Speech is also a sign of human rationality, and with speech a man can blaspheme against God. Not everything that is proper to man is necessarily good.”

Wrestling with this question is by no means restricted to ‘high-browed’, historic fiction.  Consider the following snippet from Guardian of Piri episode of Space 1999 (aired Nov of 1975) wherein Commander Konig gives his own answer for what sets man apart.

Clearly, we recognize that there is something innately special about Man even if we can’t nail down what precisely what that means.  The debate surrounding this point, which has no doubt existed as long as Man himself has had the self-awareness to wonder about his own nature, is likely to continue for as long Man lasts.

Mahalanobis distance

In last month’s column, we looked at the Z-score as a method of comparing members of disparate populations.  Using the Z-score, we could support the claim that Babe Ruth’s home run tally of 50 in 1920 was a more significant accomplishment than Barry Bonds’s 70 dingers in 2001.  This month, we look at an expansion of the notion of the Z-score due to P. C. Mahalanobis in 1936, called the Mahalanobis distance

To motivate the reasoning behind the Mahalanobis distance, let’s consider the world of widgets.  Each widget has a characteristic width and height, of which we have 100 samples.  Due to manufacturing processes, widgets typically have a spread in the variation of width and height about their standard required values.  We can assume that this variation in width and height, which we will simply refer to as Delta Width and Delta Height going forward, can be either positive or negative with equal probability.  Therefore, we expect that the centroid of our distribution in Delta Width and Delta Height to essentially be at the origin.  We’ll denote the centroid’s location by a red dot in the plots that follow.

Now suppose that we have a new widget delivered.  How do we determine if this unit’s Delta Width and Delta Height were consistent with others we’ve seen in our sample.  If we denote the new unit’s value by a green dot, we can visualize how far it is from the centroid.

For comparison, we also plotted one of our previous samples (shown in blue) that has the same Euclidean distance from the centroid as does the green point (31.8 for both in the arbitrary units used for Delta Width and Delta Height).  Can we conclude that the green point is representative of our other samples?

Clearly the answer is no, as can be seen by simply adding the other samples, shown as black points.

We intuitively feel that the blue point (now given an additional decoration of a yellow star) is somehow closer to cluster of black points but a computation of Z-scores doesn’t seem to help.  The Z-scores for width and height for the blue point are:  -2.49 and -2.82, respectively, while the corresponding values for the green point are -2.51 and 2.71. 

The problem is that Delta Width and Delta Height are strongly correlated.  One strategy is to follow the methods discussed in the post Multivariate Sampling and Covariance and move to a diagonalized basis.  Diagonalizing the data leaves the variations expressed in an abstract space spanned by the variables X and Y, which are linear combinations of the original Delta Width and Delta Height values.  The same samples plotted in these coordinates delivers the following plot.

Using the X and Y coordinates as our measures, we can calculate the corresponding Z-scores for the blue point and the new green one.  The X and Y Z-scores for the blue point are now: -2.74 and 0.24.  These values numerically match our visual impression that the blue point, while on the outskirts of the distribution in the X-direction lies close to the centroid in the Y-direction.  The corresponding X and Y Z-scores for the green point are:  0.00 and 10.16.  Again, these numerical values match our visual impression that the green point is almost aligned with the centroid in the X-direction but is very far from the variation of the distribution along the Y-direction.

Before moving onto how the Mahalanobis distance handles this for us, it is worth noting that the reason the situation was so ambiguous in W-H coordinates was that when we computed the Z-scores in the W- and H- directions, we ignored the strong correlation between Delta Width and Delta Height.  In doing so, we were effectively judging Z-scores in a bounding box corresponding to the maximum X- and Y-extents of the sample rather than seeing the distribution as a tightly grouped scatter about one of the bounding box’s diagonals.  By going to the X-Y coordinates we were able to find the independent directions (i.e. eigendirections).

The Mahalanobis distance incorporates these notions, observations, and strategies by defining a multidimensional analog of the Z-score that is aware of the correlation between the variables.  Its definition is

\[ d_M = \sqrt{ \left({\mathbf O}  – {\bar {\mathbf O}} \right)^T S^{-1} \left({\mathbf O}  – {\bar {\mathbf O}} \right) } \; , \]

where ${\mathbf O}$ is a column array of the values for the current point (i.e. the new, green one), ${\bar {\mathbf O}}$ is the average across the sample, and $S^{-1}$ is the inverse of the covariance matrix.  Forming the radical is equivalent to calculating the square of the Z-score, which is easily seen by fact the $S^-1$ has units that are inverses of observations squared (e.g., in this case $1/length^2$).  This observation also supports why the square root is needed at the end.

For the sample used in this example, the average values of W and H were:  -0.172600686508044 and 0.17708029312465146.  Using these gives the array

\[ {\bar {\mathbf O}} = \left[ \begin{array}{c} -0.172600686508044 \\ 0.17708029312465146 \end{array} \right] \; . \]

The covariance matrix was given by

\[ S = \left[ \begin{array}{cc} 78.29662631 & 63.08196815 \\ 63.08196815 & 67.35298221 \end{array} \right] \; . \]

The blue points observation vector is

\[ {\mathbf O}(blue) = \left[ \begin{array}{c} -22.109910673364368 \\ -22.83313653428484 \end{array} \right] \;  \]

and its Mahalanobis distance from the centroid is

\[ d_M(blue) = 2.805 \; . \]

In contrast the green point as an observation vector of

\[ {\bar {\mathbf O}} = \left[ \begin{array}{c} -22.30184428079569 \\ 22.306323887412297 \end{array} \right] \;  \]

giving a Mahalanobis distance from the centroid of

\[ d_M(green) = 10.142 \; . \]

Note that the magnitude of the components of blue and green observation vectors are almost identical; the critical difference being the sign on the H-component.  That sign difference reflects the large deviation away from the correlation that exists in the W-H components, which shows up in the large differences in the Mahalanobis distance.

Finally, note that the Mahalanobis distance for both the blue and green point is the root-sum-square of the individual Z-scores in X-Y coordinates; this is an important point that only holds in the diagonalized coordinate system.

By exploiting the coordinate invariance of an bilinear form, the Mahalanobis distance provides the basic machinery of calculating Z-scores in the diagonalized coordinate system without the bother of actually having to carry out the diagonalization.

Baseball and Giraffes

While it may or may not be true, as Alfred Lord Tenneson penned in his poem Locksley Hall, that “In the Spring a young man’s fancy lightly turns to thoughts of love” it is definitely true that “In the Summer a sports fan’s fancy turns heavily to thoughts of baseball”.  Long a staple of the warmer portion of the year, America’s pastime is as much an academic pursuit into obscure corners of statistical analysis as it is an athletic endeavor.  Each season, thousands upon thousands of die-hards argue about thousands upon thousands of statistics using batting averages, on base percentage, earned run averages, slugging percentages and so on to defend their favorite players or to attack someone else’s favorites.  And no testament could stand more clearly for the baseball enthusiasts love of the controversy stirred up by statistics than the movie 61*.

For those who aren’t used to seeing ‘special characters’ in their movie titles and who don’t know what significance the asterisk holds a brief explanation is in order.  Babe Ruth, one of the best and most beloved baseball players, set many of baseball’s records including a single season record of 60 home runs in 1927.  As the Babe’s legend grew and the years from 1927 faded into memory without anyone mounting much of a challenge so did the belief that no one could ever break his record.  In 1961, two players from the New York Yankees, Mickey Mantle and Roger Maris, hit home runs at a pace (yet another baseball statistic) to make onlookers believe that either of them might break a record dating back to before World War II and the Great Depression.  Eventually, Roger Maris broke the record in the 162nd game of the season, but since Ruth reached his mark of 60 home runs when baseball only played 154 games, the Commissioner qualified the Maris record with an ‘*’ as a means of protecting the mystique of Babe Ruth.

And for years afterwards, fans argued whether Maris really did break the record or not.  Eventually, Maris’s record would also fall, first to Mark McGwire, who hit 70 ‘dingers’ (baseball lingo for home runs) in 1998 and then Barry Bonds, who hit 73 in 2001.  What’s a baseball fan to do?  Should McGwire’s and Bonds receive an ‘*’?  How do we judge?

At first glance, one might argue that the best way to do it would be to normalize for the number of games played.  For example, Ruth hit 60 HRs (home runs – an abbreviation that will be used hereafter) in 154 games so his rate was 0.3896.  Likewise, Bonds’s rate is 0.4506.  And so, can we conclude that Bonds is clearly the better home run hitter.  But not so fast the purist will say, Bond’s and McGwire hit during the steroids era in baseball, when everyone, from the players on the field to the guy who shouts ‘Beer Here!’ in the stands, was juicing (okay… so maybe not everyone).  Should we put stock in the purists argument or has Bonds really done something remarkable?

This is a standard problem in statistics when we try to compare two distinct populations be they be separated in time (1920s, 1960s, 1990s) or geographically (school children in the US v. those in Japan) or even more remotely separated, for example by biology.  The standard solution is to use a Z-score, which normalizes the attributes of a member of a population to its population as whole.  Once the normalization is done, we can then compare individuals in these different populations to each other.

As a concrete example, let’s compare Kevin Hart’s height to that of Hiawatha the Giraffe.  The internet lists Kevin Hart as being 5 feet, 2 inches tall and let us suppose that Hiawatha is 12 feet, 7 inches tall.  If we are curious which of them would be able to fit through a door with a standard US height of 6 feet, 8 inches then a simple numerical comparison shows us that Hart will while Hiawatha won’t.  However, this is rarely what we want.  When we ask which is shorter or taller, it is often the case where we want to know if Hart is short as a man (he is) and if Hiawatha is short as a giraffe (she is as well).  So, how do we compare?

Using the Z-score (see also the post K-Means Data Clustering), we convert the individual’s absolute height to a population-normalized height by the formula

\[ Z_{height} = \frac{height – \mu_{height}}{\sigma_{height}} \; , \]

where $\mu_{height}$ is the mean height for a member of the population and $\sigma_{height}$ is the corresponding standard deviation about that mean.

For the US male, the appropriate statistics are a $\mu_{height}$ of 5 feet, 9 inches and a $\sigma_{height}$ of 2.5 inches.  Giraffe height statistics are a little hard to come but according to the Denver Zoo, females range between 14 and 16 feet.  Assuming this to be a 3-sigma bound, we can reasonably assume a $\mu_{height}$ of 15 feet with a $\sigma_{height}$ of 4 inches.  A simple calculation then yields a $Z_{height} = -2.8$ for Hart and $Z_{height} = -7.25$ showing clearly that Hiawatha is shorter than Kevin Hart, substantially shorter.

Now, returning to baseball, we can follow a very interesting and insightful article by Randy Taylor and Steve Krevisky entitled Using Mathematics And Statistics To Analyze Who Are The Great Sluggers In Baseball’.  They looked at mix of hitters from both the American and National leagues of Major League Baseball, scattered over the time span from 1920-2002, and record the following statistics (which I’ve ordered by date).

YearHitterLeagueHRMean HRHR Standard DeviationZ
1920Babe Ruth     AL544.857.276.76
1921Babe Ruth     AL596.058.875.97
1922Rogers HornsbyNL426.317.184.97
1927Babe Ruth     AL605.4210.055.43
1930Hack Wilson   NL5610.8611.24.03
1932Jimmie Foxx   AL588.5910.324.79
1938Hank GreenbergAL5811.611.883.91
1949Ralph Kiner   NL5410.879.784.41
1954Ted KluszewskiNL4914.0111.72.99
1956Mickey Mantle AL5213.349.394.12
1961Roger Maris   AL6115.0112.343.73
1967Carl Yastrzemski Harmon KillebrewAL4411.878.993.57
1990Cecil Fielder NL5111.128.744.56
1998Mark McGwire  NL7015.4412.484.37
2001Barry Bonds   NL7318.0313.374.11
2002Alex RodriguezNL5715.8310.513.92

Interestingly, the best HR season, in terms of Z-score, is Babe Ruth’s 1920 season where he ‘only’ hit 54 HRs.  That year, he stood over 6.5 standard deviations from the mean making it far more remarkable than Barry Bonds 73-HR season.  Sadly, Roger Maris’s 61*-HR season is one of the lower HR championships on the list. 

To be clear, it is unlikely that any true, dyed-in-the-wool, baseball fan will be swayed by statistical arguments.  Afterall, where is the passion and grit in submitting an opinion to the likes of cold, impartial logic.  Nonetheless, the Z-score is not only in its own right but serves also as a launching pad for more sophisticated statistical measures. 

To Winograd or not to Winograd

Since its inception, a common theme that has appeared frequently in this column is the nuances and ambiguities in natural language.  There are several reasons for this focus but the two most important ones are that being able to handle linguistic gray areas is a real test for machine intelligence and that by looking at how computer systems struggle with natural language processing we gain a much better appreciation how remarkable the human capacity to speak really is.

Past columns have focused mostly on equivocation is various forms, with an emphasis on humor (Irish Humor, Humorous Hedging, Yogi Berra Logic, and Nuances of Language) and context-specific inference (Teaching a Machine to Ghoti and Aristotle on Whiskey).  But the ‘kissing-cousin’ field of the Winograd Schema remained untouched because it had remained unknown.  A debt of gratitude is thus owed to Tom Scott for his engaging video The Sentences Computers Can’t Understand, But Humans Can which opened this line of research into natural language processing by machine to me.

When Hector Levesque proposed it in 2011, he designed the Winograd Schema Challenge (WSC) to address perceived deficiencies in the Turing test by presenting a challenge requiring ‘real intelligence’ rather than the application of trickery and brute force.  The accusation of trickery apparently particularly into view due to the Eugene Goostman chatbot, a system that portrayed itself as a 13-year old boy from Odesa in the Ukraine, having fooled roughly 30% of the human judges in large Turing Test competition in 2014.  To achieve this, the Wikipedia article maintains that the bot used ‘personality quirks and humor to misdirect’, which basically means that the judges were conned by the creators.  The idea of a confidence man pulling the wool over someone eyes probably never occurred to Alan Turing nor to the vast majority of computer scientists but anyone who’s seen a phishing attempt is all too familiar with that style of chicanery.

The essence of the WSC is to ask a computer intelligence to resolve a linguistic ambiguity that requires more than just a grammatical understanding of how the language works (syntax) and the meaning of the individual words and phrases (semantics).  Sadly the WSC doesn’t focus on equivocation (alas the example below will not be anything more than incidentally humorous) but rather on what linguists call an anaphora, which is the formal meaning attached to an expression whose meaning must be inferred from an earlier part of the sentence, paragraph, etc. 

A typical example for an anaphora involves the use of a pronoun in a sentence such as

John arrived at work on time and entered the building promptly but nobody claimed to have seen him enter.

Here the pronoun ‘him’ is the anaphora and is understood to have John as its antecedent. 

The fun of the WSC is in creating sentences in which the antecedent can only be understood contextually as the construction is ambiguous.  One of the typically examples used in explaining the challenge reads something like

The councilmen refused to grant the protestors a permit to gather in the park because they feared violence.

The question, when posed to a machine, would be to ask if the ‘they’ referred to the councilmen or the protestors.  Almost all people would find no ambiguity in that sentence because they would argue that the protestors would be the ones making the ruckus and that the councilmen, either genuinely worried about their constituents or cynically worried about their reputations (or a mix of both), would be the ones to fear what might happen.

Note that the sentence easily adapts itself to other interpretations with only the change of one word.  Consider the new sentence

The councilmen refused to grant the protestors a permit to gather in the park because they threatened violence.

Once again, a vast majority of people would now say that the ‘they’ referred to the protestors because the councilmen would not be in the position to threaten violence (although things may be changing on this front).

The idea here is that the machine would need not just the ability to analyze syntax with a parser and the ability to look up words with a dictionary but it would also need to reason and that reasoning would be broad rather than narrowly focused.  Relationships between concepts would be varied and far-ranging with the sentence

The trophy couldn’t fit into the suitcase because it was too large.

Here the ontology would center on spatial reasoning, the ideas of ‘big’ and ‘little’, and the notion that suitcases usually contain other objects. 

These types of ambiguous sentences seem to be part and parcel of day-to-day interactions.  For example, the following comes from Book 5 of the Lord of the Rings

The big orc, spear in hand, leapt after him. But the tracker, springing behind a stone, put an arrow in his eye as he ran up, and he fell with a crash. The other ran off across the valley and disappeared.

This scene, which takes place after Frodo and Sam have managed to escape the tower of Cirith Ungol, is between a large fighter orc and a smaller tracker.  Simple rules of syntax might lead the machine to believe that the ‘he’ in the second sentence would have ‘the tracker’ as its antecedent.  I doubt any human reader was fooled.

The complexity of the WSC is not limited to only two objects.  Consider the following example taken from the Commonsense Reasoning ~ Pronoun Disambiguation Problems database:

I asked Dave to get me my sweater from the other side of the yacht. While he was gone, I rested my arm on the rail over there and suddenly it gave way.

The machine is to pick from the multiple choices (a) sweater (b) yacht (c) arm (d) rail.  Again, I doubt that any person, of sufficient age, would be confused by this example and so wonder why.  We are literally surrounded with ambiguous expressions every day.  Casual speech thrives on these corner-cutting measures.  Even our formal writings are not immune; there are numerous examples of these types of anaphoras in all sorts of literature with the epistles of St. Paul ranking up there for complexity and frequency.  Humans manage quite nicely to deal with them – a tribute to the innate intelligence of the species as a whole.

But knowing humans to be intelligent and observing them being able to deal with these types of ambiguity does not mean the converse if true.  Being able to pass the WSC does not mean the agent is necessarily smart.  The argument for this conclusion comes, disappointingly, from the fact that it has already been overcome by various algorithms 7 years after its proposal.  Sampling the associated papers, the reader will soon find that much of the magic comes from a different flavor of statistical association, indicating that the real intellect resides in the algorithm designer.  This is a point raised by The Defeat of the Winograd Schema Challenge by Vid Kocijan, Ernest Davis, Thomas Lukasiewicz, Gary Marcus, Leora Morgenstern.  To quote from section 4 of their paper:

The Winograd Schema Challenge as originally formulated has largely been overcome. However, this accomplishment may in part reflect flaws in its formulation and execution. Indeed, Elazar et al. (2021) argue that the success of existing models at solving WSC may be largely artifactual. They write:

We provide three explanations for the perceived progress on the WS task: (1) lax evaluation criteria, (2) artifacts in the datasets that remain despite efforts to remove them, and (3) knowledge and reasoning leakage from large training data.

In their experiments, they determined that, when the form of the task, the training regime, the training set, and the evaluation measure were modified to correct for these, the performance of existing language models dropped significantly.

At the end of the day, I think it is still safe to say that researchers are clever in finding ways to mimic intelligence in small slices of experience but that nothing still approaches the versatility and adaptability of the human mind.

Visions of Clustering

In a column from a few months ago (Fooling a Neural Network), I discussed the fact that human beings, regardless of mental acuity, background, or education, see the world in a fundamentally different way than neural networks do.  Somehow, we are able to perceive the entire two-dimensional structure of an image (ignoring for the time being the stereoscopic nature of our vision that leads to depth perception) mostly independent of the size, orientation, coloring, and lighting.  This month’s column adds another example of the remarkable capabilities of human vision when compared to a computer ‘vision’ algorithm in the form of an old algorithm with a new classification.

The Hoshen-Kopelman algorithm was introduced by Joseph Hoshen and Raoul Kopelman in their 1976 paper Percolation and Cluster Distribution. I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.  The original intent of the algorithm was to identify and label linked clusters of cells on a computational grid as a way of finding spanning clusters in the percolation problem.  A spanning cluster is a set of contiguous cells with orthogonal nearest neighbor that reach from one boundary to the opposite side.  The following image shows a computational 7×7 grid where the blue squares represent an occupied site and the white squares as vacant ones.  Within this grid is a spanning cluster with 24 linked cells that reaches from the bottom of the grid to the top and which also spans the left side of the grid to the right.

The original intention of percolation models was to study physical systems such as fluid flow through irregular porous materials like rock or the electrical conduction of random arrangement of metals in an insulating matrix.  But clearly the subject has broader applicability to various problems where the connection topology of a system of things is of interest such as a computer network or board games.  In fact, the configuration shown above would be a desirable one for the charming two-player game The Legend of Landlock.

Now the human eye, beholding the configuration above has no problem finding, with a few moments of study, that there are 4 distinct clusters: the large 24-cell one just discussed plus two 3-cell, ‘L’-shaped ones in the upper- and lower-left corners and a 2-cell straight one along the top row.  This is a simple task to accomplish for almost anyone older than a toddler.

Perhaps surprisingly, ‘teaching a machine’ (yes this old algorithm is now classified – at least according to Wikipedia – as a machine learning algorithm) to identify clusters like these is a relatively complicated undertaking requiring rigorous analysis and tedious bookkeeping.  The reasons for this speak volumes for how human visual perception differs sharply from machine vision.  However, once the machine is taught to find clusters and label them, it tirelessly and rapidly tackles problems humans would find boring and painful.

To understand how the algorithm works and to more clearly see how it differs essentially from how humans perform the same task, we follow the presentation of Harvey Gould and Jan Tobochnik from their book An Introduction to Computer Simulation Methods: Applications to Physical Systems – Part 2. The grid shown above is taken from their example in Section 12.3 cluster labeling (with some minor modifications to suit more modern approaches using Python and Jupyter). 

Assuming that the grid is held in a common 2-dimensional data structure, such as a numpy array, we can decorate the lattice with row and column labels that facilitate finding a grid cell.

We start our cluster labeling at 1 and we use a cluster labeling variable to keep track of the next available label when the current one is assigned.  The algorithm starts in the cell in the upper left corner and works its way, row by row, across and down, looking at cells immediately up and left to see if the current on is linked to cells in the previous row or column.  This local scanning may seem simple enough but generically there will be linkages later on that coalesce two distinct clusters into a new, larger, single cluster and this possibility is where the subtly arises.  This will become much clearer as we work through the example.

Since the cell at grid (0,0) is occupied, it is given the label of ‘1’ and the cluster label variable is incremented to ‘2’.  The algorithm then moves to cell (0,1).  Looking left is finds that the current cell is occupied and that it has a neighbor to the left and so it too is labeled with a cluster label ‘1’.  The scan progresses across and down the grid until we get to cell (1,5).

Note that at this point, the human eye clearly sees that the cells at (1,4) and (1,5) are linked neighbors and that the assignment of the cluster label ‘4’ to cell (1,4) was superfluous since it really is a member of the cluster with label ‘3’ but the computer isn’t able to see globally. 

The best we can do with the machine is to locally detect when this happens and keep a ‘second set of books’ in the form of an array that lets the algorithm know that cluster label ‘4’ is really a ‘3’.  Gould and Tobochnik call the cluster label ‘4’ as improper and the label ‘3’ as proper but raw and proper are better terms.

When the algorithm has finished scanning the entire grid, the raw cluster labels stand as shown in the following image.

Note that the largest cluster label assigned is ‘9’ even though there are only 4 clusters.  The association between the raw labels and the proper ones is given by:

{1: 1, 2: 2, 3: 3, 4: 3, 5: 3, 6: 3, 7: 7, 8: 3, 9: 3}

Any raw labels that are identical with their proper labels are distinct and proper with no additional work.  Those raw labels that are larger in value than their proper labels are part of the cluster with the smaller label.  The final step for this example is cosmetic and involves relabeling the raw labels ‘4’, ‘5’, ‘6’, ‘8’, and ‘9’ as ‘3’ and then renumbering the ‘7’ label as ‘4’.  All of this is done with some appropriate lookup tables.

One thing to note.  Because ‘8’ linked to ‘3’ in row 5, ‘8’ had 3 as a proper label before ‘9’ linked to ‘8’ and thus, when that linkage was discovered, the algorithm code immediately assigned a proper label of ‘3’ to ‘9’.  Sometimes the linkages can be multiple layers deep.

Consider the following example.

By the time the algorithm has reached cell (5,8) it has assigned cluster label ‘10’ to cell (5,6) but has associated with it a proper label of ‘9’ based its subsequent discovery in cell (5,7).  Only when the machine visits cell (5,8) does it discover that cluster label ‘9’ is no longer proper and should now be associated with the proper label ‘6’.  Of course, the algorithm ‘knows’ this indirectly in that it has the raw label ‘10’ associated with the label ‘9’ which is now associated with the proper label ‘6’.  Thus the association between raw and proper labels is now a two-deep tree given by (key part highlighted in red):

{1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 1, 9: 6, 10: 9, 11: 11, 12: 12}

Again, a human being simply doesn’t need to go through these machinations (at least not consciously) further emphasizing the gap between artificial intelligence and the real one we all possess.

(Code for my implementation of the Hoshen-Kopelman algorithm in python can be found in the Kaggle notebook here.)

Is True Of – Part 2 Linguistic Analog

This month’s post is part 2 of 2 of the exploration of the YouTube video entitled Russell’s Paradox – a simple explanation of a profound problem by Jeffery Kaplan.  What we discussed last month was the summary of Russell’s paradox in which we found that the set of all ordinary sets defined as

\[ {\mathcal Q} = \{ x | x \mathrm{\;is\;a\;set\;that\;does\;not\;contain\;itself} \} \; \]

creates a paradox.  If we assume ${\mathcal Q}$ does not contain itself (i.e., it is ordinary) then the membership comprehension ‘is a set that does not contain itself’ instructs us that ${\mathcal Q}$ in fact does contain itself.  Alternatively, if we assume that ${\mathcal Q}$ does contain itself (i.e., it is extraordinary) then membership comprehension instructs us that it doesn’t. 

This type of self-referential paradox mirrors other well-known paradoxes that arise in linguistics such as the liar’s paradox or the concept of the self-refuting idea.  What makes Kaplan’s analysis interesting (whether or not it originates with him) is the very strong formal analogy that he draws between the common act of predication that we all engage in nearly continuously, and the more abstruse structure of Russell’s paradox that few among us know or care about.

The heart of Kaplan’s analogy is the explicit mapping of the ‘contains’ idea from set theory – that sets contain members or elements, some of which are sets, including themselves – with the ‘is true of’ idea of predication. 

To understand what Kaplan means by the ‘is true of’, we will again follow the structure of his examples with some minor verbal modifications to better suit my own taste.

Predication is the act of saying something about the subject of a sentence by giving a condition or attribute belonging to the subject.  In the following sentence

Frodo is a brave hobbit.

The subject of the sentence is “Frodo” and the predicate is “is a brave hobbit”.  The predicate “is a brave hobbit” is true of Frodo, as anyone who’s read Lord of the Rings can attest.  Kaplan then points out that the first basic rule of naïve set theory, which he states as

Rule #1 of sets: there is a set for any imaginable collection of a thing or of things”

has, as its formal analogy in predication, the following:

Rule #1 of predication: there is a predicate for any imaginable characteristic of a thing.

The two key rules of set theory that lead to Russell’s paradox have their analogs in predication as well.   

Rule #10 of sets, which allows us to have sets of sets, is mirrored by Rule #10 of predication that tells us we can predicate things about predicates.  As an example of this, consider the following sentence:

“Is a Nazgul” is a terrifying thing to hear said of someone.

The predicate “Is a Nazgul” is the subject of that sentence and “is a terrifying thing to hear said of someone” is the predicate.

Rule #11 of sets, which allows sets to contain themselves (i.e., self-reference), finds its analog in Rule #11 of predication that tells us that predicates can be true of themselves.

Here we must proceed a bit more carefully.  Let’s start with a simple counterexample:

“Is a hobbit” is a hobbit.

This sentence is clearly false as the subject, the predicate “Is a hobbit”, is clearly not a hobbit itself, it is a predicate.  But now consider the following sentence, which Kaplan offers:

“Is a predicate” is a predicate

This sentence is clearly true as the subject, the predicate “Is a predicate”, is clearly a predicate.  And, so, Rule #11 of predication works.

Kaplan then constructs a table similar to the following (again only minor verbal tweaks are done for the predicates that are not true of themselves to suit my own taste)

Predicates not true of themselvesPredicates true of themselves
“is a brave hobbit”“is a predicate”
“is a Nazgul”“is a string of words”
“keeps his oaths”“typically comes at the end of a sentence”

Note that the predicate “is true of itself” is a predicate that is true of all the predicates that are true of themselves, that is to say, of all the predicates that can be placed in the right column of the table above. The next step is then to ask what is the predicate of all the predicates that can be placed in the left column of the above table.  A little reflection should satisfy oneself that the predicate “is not true of itself” fits the bill. 

The final step is to ask in which of the two column does “is not true of itself” fall, or, in other words,

is “is not true of itself” true of itself?

If we assume that it is true of itself then the content found between the quotes tells us that it is not true of itself.  Equally vexing, if we assume that it is not true of itself, that assumption matching the content found between the quotes tells us that it is true of itself.  In summary: if it is then it isn’t and if it isn’t then it is.  And we’ve generated the predicate analogy to Russell’s paradox.

Of course, this is just a form of the well-known Liar’s Paradox, so we might be willing to just shrug it off as a quirk of language, but I think Kaplan is making a deeper point that is worth deeply considering.  At the root of his analysis is the realization that there are objective rules (or truths), that these rules generate self-referential paradoxes, and, so, one is forced to recognize that paradoxes are an essential ingredient in not just all of language but of thought itself.  And no amount of patching, such as was done to naïve set theory, can rescue us from this situation.  This observation, in turn, has the profound philosophical implication that there is only so far that logic can take us.

Is True Of – Part 1: Russell Redux

This month’s post and the next one are based on the YouTube video entitled Russell’s Paradox – a simple explanation of a profound problem by Jeffery Kaplan.  In that video, Kaplan links the well-known mathematical paradox found by Bertrand Russell with familiar linguistic paradoxes in a what a friend characterized as a “stimulating alternative perspective”. 

Russell’s paradox shook the foundations of mathematics by casting doubt on whether it were possible to logically establish mathematics as an objective discipline with rules that reflect something beyond human convention or construction.  And while, in the aftermath, a variety of patches were proposed that sidestep the issue by eliminating certain constructions, Kaplan’s essential point is that the same mental processes that lead to Russell’s paradox lead to logical paradoxes in natural language.  These natural language paradoxes, in turn, reflect something deeper in how we think and, as a result, we can’t sidestep these processes in everyday life the way they are currently and narrowly sidestepped in mathematics.  Paradoxes seem to be built into the fabric of human existence.

To be clear, his essential point is not novel and the connections that exist in human thought between formal logic, mathematical logic, and linguistics have been covered in within a variety of contexts including the liar’s paradox.  What is intriguing about Kaplan’s analysis is the method he employs to make this point using the basic function of predication within natural language.  I don’t know if his argument is truly one of his own making or it originates elsewhere but it is a clever and particularly clear way of seeing that logic can only carry one so far in the world.

Following Kaplan, we’ll start with a review of Russell’s paradox.  The paradox arises in naive set theory as a function of three basic ideas.  First is the notion of a set as a collection of any objects of our perception or of our thought.  Second is the idea that set composition is unlimited in scope, concrete or intangible, real or imagined, localized or spread over time and space.  Kaplan calls this idea Rule Number 1 – Unrestricted Composition.  Third is the idea that what matters for a set is what elements contains not how those elements are labeled.  Set membership can be determined by a specific listing or by specifying some comprehension rule that describes the members globally.  Kaplan calls this idea Rule Number 2 – Set Identity is Determined by Membership.

Using Rules 1 and 2, Kaplan then outlines a set of rules 3-11, each springing in some way from Rules 1 or 2 as a parent, which is indicated inside the parentheses:

  • Rule 3 – Order Doesn’t Matter (Rule 2)
  • Rule 4 – Repeats Don’t Change Anything (Rule 2)
  • Rule 5 – Description Doesn’t Matter (Rule 2)
  • Rule 6 – The Union of Any Sets is a Set (Rule 1)
  • Rule 7 – Any Subset is a Set (Rule 1)
  • Rule 8 – A Set Can Have Just One Member (Rule 1)
  • Rule 9 – A Set Can Have No Members (Rules 1 & 2)
  • Rule 10 – You Can Have Sets of Sets (Rule 1)
  • Rule 11 – Sets Can Contain Themselves (Rule 1)

Kaplan walks the viewer, in an amusing way, through increasing more complex set construction examples using these 11 rules, although I’ll modify the elements used in the examples to be more to my liking.  His construction starts with an example of a finite, listed set:

\[ A = \{ \mathrm{Frodo}, \mathrm{Sam}, \mathrm{Merry}, \mathrm{Pippin} \} \;  .\]

 Employing Rule 2 allows us to rewrite set $A$ in an equivalent way as

\[ A = \{ x | x \mathrm{\;is\;a\;hobbit\;in\;the\;Fellowship\;of\;the\;Ring} \} \; .\]

Kaplan then gives the famous example of a set used thought by Gottlob Frege and Bertrand Russell as a candidate for the fundamental definition of what the ordinal number 1 is:

\[ \{ x | x \mathrm{\;of\;singleton\;sets} \} \; .\]

Here we have a set, that if listed explicitly, might start as

\[ \{ \{\mathrm{\scriptsize Frodo}\}, \{\mathrm{\scriptsize Sam}\}, \{\mathrm{\scriptsize Merry}\}, \{\mathrm{\scriptsize Pippin}\}, \{\mathrm{\scriptsize Chrysler\;Building}\} \ldots \} \; .\]

All of this seems plausible if not particularly motivated, but the wheels fall off when we look at Rule 11.  That rule is deceptively simple in that it is easy to say the words ‘sets can contain themselves’ but is ultimately difficult, perhaps impossible, to understand what those words mean as there exists no constructive way to actually build a set that contains itself.  Membership is done by a comprehension rule expressed as a sentence; any sentence will do and quoting Kaplan: “If you can think of it, you can throw it in a set.”  We’ll return to that sentiment next month when we talk about Kaplan’s language-based analog to Russell’s paradox.  We’ll call such a Rule-11 set an extraordinary set and any set not containing itself ordinary.  The next step is then to define the set-of-all-extraordinary sets as

\[ \{ x | x \mathrm{\;is\;a\;set\;that\;contains\;itself} \} \; .\]

This set, while having no constructive way of being listed, is still considered a valid set by naive set theory.  As preposterous as this set’s existence may be it generates no paradoxes.  The real heartbreaker is the set defined as

\[ \{ x | x \mathrm{\;is\;a\;set\;that\;does\;not\;contains\;itself} \} \; .\]

This set of all ordinary sets has no well-defined truth value.  If we assume it does contain itself then it must obey the comprehension rule ‘does not contain itself’ so then it doesn’t contain itself.  Alternatively, if we assume it does not contain itself then, in order to be inside itself, it must be true that it meets the inclusion criterion that it doesn’t contain itself.  And, thus, the paradox is born.

Mathematicians have patched set theory by ‘changing the rules’ (as Kaplan puts it).  They developed different systems wherein Rule 11 or some equivalent is expressly forbidden (e.g., Zermelo-Fraenkel with the Axiom of Choice). 

But, Kaplan objects, the rules of set theory are not made up but are objective rules that reflect the object, common rules of thought and language that we all use.  He gives a linguistic argument based on the act of predication in natural language to make this point.  That analysis is the subject of next month’s posting.