Analog Logic in the Digital Age

Latest Posts

Time Series 6 – Recursive Least Squares

Conrad SchiffMay 24, 2024

The last post introduced the notion of the recursive least squares estimation approach as a natural precursor to the Kalman filter and derived the Kalman-like gain that governed the state estimation update equation. Before moving on to a simple example of the Kalman filter, it seemed prudent to show the recursive least squares filter in action with a numerical example.

The example involves a mass falling in an unknown uniform gravitational field with unknown initial conditions. At regular time intervals, a measurement device measures the height of the mass above the ground but only to within certain precision due to inherent noise in it measuring processes. The challenge is then to deliver the best estimate of the unknowns (initial height, initial speed, and the uniform gravitational acceleration). We will assume that, beyond a measurement of the initial height, the other initial conditions are inaccessible to the measurement device since it only measures range. This example is adapted from the excellent article Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method by Aleksander Haber.

To prove that the recursive least squares method does generate a real least squares estimation, let’s first look at how a batch least squares approach to this problem would be performed.

The dynamics of the falling mass are given by

\[ d(t) = d_0 + v_0 t + \frac{1}{2} a t^2 \; , \]

where $d(t)$ is the height above the ground at a given instant.

We assume that the above equation is exact; in other words, there are no unknown forces in the problem. In the language of Kalman-filtering and related optimal estimation, there is said to be no process noise, with it understood that the word ‘process’ is a genetic umbrella meaning the dynamics by which the state evolves in time, for which Newton’s laws are an important subset.

By design, measurements come in every $\Delta t$ seconds and so the time evolution of the fall of the mass is sampled at times $t_k = k \Delta t$. At measurement time $t_k$ the amount the mass has dropped is

\[ x_k = x_0 + v_0 k \Delta t + \frac{1}{2} k^2 \Delta t^2 a \; \]

and the corresponding measurement is

\[ z_k = x_k + \eta_k \; , \]

where $\eta_k$ is the noise the measuring device introduces in performing its job.

The batch least squares estimation begins by writing the evolution equation in a matrix form

\[ z_k = \left[ \begin{array}{ccc} 1 & k \Delta t & \frac{1}{2} k^2 \Delta t^2 \end{array} \right] \left[ \begin{array}{c} x_0 \\ v_0 \\ a \end{array} \right] + \eta_k \equiv H_{k,p} x_p + \eta_k \; . \]

The array $x_p$ contains the constant initial conditions and unknown acceleration that is being estimated.

For this numerical experiment, $x_0 = -4m$, $v_0 = 2 m/s$ (positive being downward), and $a = 9.8 m/s^2$ the usual value for surface of the Earth (give or take regional variations). Measurements are taken every $\Delta t = 1/150 s$ over a time span of 14.2 seconds. To illustrate the meaning of $H_{k,p}$, the value of the matrix at $k=0$ is

\[ H_{k=0,p} = \left[ \begin{array}{ccc} 1 & 0 & 0 \end{array} \right] \; \]

corresponding to an elapsed time of $t_k = 0$ and at $k = 75$

\[ H_{k=0,p} = \left[ \begin{array}{ccc} 1 & 0.5 & 0.125 \end{array} \right] \; \]

corresponding to an elapsed time of $t_k = 0.5$.

The merit of this matrix approach is that by putting every measurement into a batch we can write the entire measured dynamics in a single equation as:

\[ z = H x + \eta \; . \]

The array $z$ has $M$ elements (for the number of measurement), where $M$ is the maximum value of the time index $k$ (for this specific example $M$ = 2131). The matrix $H$ is a $M \times n$ where $n$ is the number of estimate parameters ($n$ = 3, in this case) and $x$ is the array of parameters to be estimated). The array $\eta$ represents the $M$ values of random noise (assumed to be stationary) that were present in the measurements. The values of $\eta$ used in this numerical example were drawn from a normal distribution with a zero mean and a standard deviation of 1; the latter value determines the value of $R_k = 1$.

Since, in the matrix equation above, the matrix $H$ isn’t square, it may not be obvious how to solve for $x$. However, there is a well-known technique using what is known as the Moore-Penrose pseudoinverse.

In short, to solve for $x$ in a least squares sense, we set $\eta$ to zero (i.e., we ignore the noise assuming the classical argument that with enough measurements the influence of the noise will average to zero). We then left-multiply by the transpose of $H$ to get

\[ H^T z = (H^T H) x \; .\]

The quantity $H^T H$ is a $(n \times K) \cdot (K \times n) = n \times n $ matrix (in this case $3\times 3$, which is invertible. We then arrive at the compact form of the estimate for $x$ as

\[ x_{est} = (H^T H)^{-1} H^T z \; .\]

For this numerical experiment, the resulting estimate is

\[ x_{est} = \left[ \begin{array}{c} -3.95187658 \\ 1.99300377 \\ 9.80040125 \end{array} \right] \; .\]

This is the same value one gets by using a package like numpy’s linalg.lstsq.

In the recursive method, all of the data are not stored and handled at once but rather each point is taken and processed as it comes in.

The algorithm developed in the last post is summarized as:

Start each step with the current time $t_k$, measurement $z_k$, and the previous estimates of the state $x_{k-1}$ and the covariance $P_{k-1}$
Perform the $k$-th calculation of the gain: $K_k = P_{k-1} \cdot H_k^T \cdot S_k^{-1}$, where $S_k = H_k \cdot P_{k-1} \cdot H_k^T + R_k$
Make the $k$-th estimate of the unknowns: ${\hat x}_k = {\hat x}_{k-1} + K_k \left( z_k – H_k {\hat x}_{k-1} \right)$
Make the $k$-th update of the covariance matrix: $P_k = \left(I – K_k H_k \right)P_{k-1} $

Using this algorithm, the time evolution of the estimated initial position is

with a final value of -3.95170034.

Likewise, the time evolution of the estimated initial speed is

with a final value of 1.99295453.

And, finally, the time evolution of the estimated acceleration is

with a final value of 9.80040698.

These estimated values, up to the numerical noise introduced by a different set of floating point operations, are identical with the batch least squares numbers.

Uncategorized

May

2024

Time Series 5 – Deriving a Kalman-like Gain

Conrad SchiffApril 26, 2024

Last month’s blog presented one of the most common approaches to handling time series data in real time known as the Kalman gain. As noted, the form of the state update equations is formally

\[ x_{new} = x_{old} + K ( obs_{new} – obs_{pred} ) \; , \]

where $x$ is the state being estimated and $obs$ is some observation made of the state. The subscripts on the observation terms are ‘new’ for the latest actual observation (made by some instrument) and ‘pred’ for what the expected predicted observation would be before we made it.

For the sequential or running average discussed in the first of these series of posts, this general form specifies as

\[ {\bar x} = {\bar x}^{-} + \frac{1}{N} \left( x_N – {\bar x}^{-} \right) \; , \]

where ${\bar x}$ is the estimated average. For exponential smoothing (and, by extension Holt-Winter), discussed in the third of these series of posts, the general form becomes

\[ s_k = s_{k-1} + \alpha \left( x_k – s_{k-1} \right) \; , \]

where $s_k$ is the smoothed form of some time-varying quantity $x_k$. For Kalman filtering, discussed in the fourth of these series of posts, the general form takes on the appearance of

\[ {\hat x}_k = {\hat x}_k^{-} + K_k \left( z_k – H {\hat x}_k^{-} \right) \; , \]

where ${\hat x}_k$ is the revised estimated state at time $t_k$ based on the difference between the current measurement $z_k$ and the a priori estimated state ${\hat x}_k^{-}$ at the same time.

The ‘magic’ in all these cases lies firmly in determining the ‘gain’ that sits outside the difference between the new observation datum and the expected value of the same based on the previous estimate.

In running average and Holt-Winter cases, it is fairly easy to deduce the form of the gain without any formal mathematical manipulations. A few rearrangements and some intuition serve. In the case of the Kalman filter a lot more effort is involved.

As a stepping-stone exercise which has almost all of the ingredients found in the Kalman filter but is both easier to understand and slightly easier to manipulate is the recursive least squares estimator. The presentation below is strongly influenced by the excellent article Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method by Aleksander Haber with mathematical simplifications gleaned from the Wikipedia entry on the Kalman Filter.

Haber’s system of equations models a process in which a constant set of parameters are unknown but can be estimated from a time series of measurements of a time-varying state whose evolution is solely determined by the set of parameters. The measurement equation is given by

\[ z_k = H_k x + v_k \; . \]

The time varying matrix $H_k$ serves to map the constant but unknown state $x$ forward in time. The term $v_k$ represents the measurement noise, which is assumed to be zero-mean Gaussian distributed with a covariance given by

\[ E[v_k v_k^T ] = R \; .\]

The concrete example that Haber provides is an object being subjected to constant acceleration starting from an unknown initial position and speed. The object’s motion obeys

\[ x(t) = x_0 + v_0 t + \frac{1}{2} a t^2 = \left[ \begin{array}{ccc} 1 & t & \frac{1}{2} t^2 \end{array} \right] \left[ \begin{array}{c} x_0 \\ v_0 \\ a \end{array} \right] \equiv H(t) x \; .\]

The conversion from the continuous process matrix $H(t)$ to the discrete time-sampled $H_k$ is trivially done by noting that the time $t_k = k \Delta t$.

Haber claims that a least squares estimation of $x$ can be determined recursively by the process

\[ x_k = x_{k-1} + K_k ( z_k – H_k x_{k-1} ) \; , \]

where $x_k$ is the current estimate of $x$, $z_k$ is the current measurement, and in which the gain $K_k$ will be given by the optimality condition that we want to minimize the error.

To find the explicit form of the gain, we first define the error in state estimation as

\[ \epsilon_k = x – {\hat x}_k \; . \]

Since the parameter set is generically multi-dimensional we will want to minimize the total error

\[ W = E[(\epsilon_k)_1^2] + E[(\epsilon_k)_2^2]+ \ldots + E[(\epsilon_k)_i^2] \; , \]

where the expectation is over all possible realizations of the random error in the measurement. If we form the covariance matrix of the error

\[ P_k = E[ \epsilon_k \epsilon_k^T] \; \]

then the total error is trace of the covariance matrix: $W = Tr(P)$. Taking the derivative of this trace and setting it equal to zero provides the explicit form for the gain as follows.

First, re-express the current estimation error in terms of the previous time estimation and via the update equation

\[ \epsilon_k = x – x_{k-1} – K_k(z_k – H_k x_{k-1}) \; .\]

Using the measurement equation, current estimation error becomes

\[ \epsilon_k = x – x_{k-1} – K_k( H_k x + v_k – H_k x_{k-1} ) \; .\]

This last form re-emphasizes that, in the absence of the measurement noise $v_k$, the determination of the unknown parameter set $x$ would be exact since all the quantity in the parentheses would be identically zero.

Regrouping the terms so that the state variables are gathered separately from the noise yields

\[ \epsilon_k = (1 – K_k H_k) (x – x_{k-1}) – K_k v_k \equiv T_k (x – x_{k-1}) – K_k v_k \; . \]

Now it relatively easy to right multiply the above expression by its transpose

\[ \epsilon_k \epsilon_k^T = T_k (x – x_{k-1} ) (x – x_{k-1} )^T T_k^T + K_k v_k v_k^T K_k^T \\ – K_k v_K (x – x_{k-1} )^T T_k^T – T_k (x – x_{k-1} ) v_k^T K_k^T \; \]

and take an expectation of it to arrive at the covariance matrix

\[P_k = E[ (T_k (x – x_{k-1}) (x – x_{k-1})^T T_k^T] – E[K_k v_k v_k ^T K_k^T] \; . \]

The cross-terms between the state and the measurement noise vanish under the assumption that the noise is zero-mean distributed (i.e., unbiased).

Since $K_k$ and $H_k$ only depend on the time-independent value of the noise covariance $R$, they can be factored out of the expectation and we arrive at

\[ P_k = (1 – K_k H_k) P_{k-1} (1 – K_k H_k)^T + K_k R K_k^T \; , \]

which is the Joseph form of covariance propagation relating the covariance matrix at time $t_k$ to one at an earlier time $t_{k-1}$.

The last step is to take the trace and minimize it. While there are a number of formulae for doing this step that engineers and mathematicians favor, it is easier to use index notation since it frees one from the bother of finding the correct formula and then applying it.

To proceed, expand the Joseph form term-by-term and define $S_k = H_k P_{k-1} H_k^T + R$, to get

\[ P_k = P_{k-1} – K_k H_k P_{k-1} – P_{k-1} H_k^T K_k^T + K_k S_k K_k^T \; . \]

Now, we can suppress the indices on the right since we know that every matrix bears a $k$ subscript except any covariance matrix, which must have a $k-1$ subscript.

The trace is then given by

\[ Tr(P_k) = P_{ii} – K_{ij} H_{j\ell} P_{\ell i} – P_{ij} C_{j\ell}^T K_{\ell i}^T + K_{ij} S_{j \ell} K_{\ell i} \; .\]

The least squares minimization is expressed as

\[ \frac{\partial Tr(P_k)}{\partial K_{st}} = 0 \; . \]

The derivative on the left can be expressed as

\[ \frac{\partial Tr(P_k)}{\partial K_{st}} = – H_{t \ell} P_{\ell s} – P_{sj}H_{jt}^T + S_{t \ell} K_{\ell s}^T + K_{sj} S_{jt} \; .\]

Re-arranging the terms and indices (subject to the transposes) to return to a matrix-multiplication form, yields

\[ \frac{\partial Tr(P_k)}{\partial K} = – 2 P_{k-1} H_k^T + K_k (S_k + S_k^T) \; , \]

where the $k$ index has been restored once the matrix indices have been suppressed.

The last step is to recognize that

\[ S^T = (H P H + R)^T = (H^T)^T P^T H^T + R^T = H P H^T + R \; \]

since both the state and noise covariance matrices are symmetric.

The final step is to set all of that equal to zero and solve to $K_k$. Doing so gives

\[ K_k S_k = P_{k-1} H_k^T \; , \]

which immediately solves to

\[ K_k = P_{k-1} H_k^T S_k^{-1} = P_{k-1} H_k^T \left[ H_k P_{k-1} H_k^T + R \right]^{-1} \; .\]

Next blog, we’ll explore a numerical example of this method and establish that it does deliver the least squares estimation of the parameter set $x$.

Uncategorized

Apr

2024

Time Series 4 – Introduction to Kalman Filtering

Conrad SchiffMarch 29, 2024

This month we turn from the exponential smoothing and moving average techniques for analyzing time series and start a sustained look at the Kalman filter. The Kalman filter has been hailed as one of the most significant mathematical algorithms of the 20^th century. The primary reason for this praise is that the Kalman filtering enables a real time agent to make mathematically precise and supported estimates of some quantity as each data point comes in rather than having to wait until all the data have been collected. This enabling behavior garners Kalman filtering a lot of attention and Russell and Norvig devote part of their chapter on probabilistic reasoning in their textbook Artificial Intelligence: A Modern Approach to it.

Before diving into the mathematical details, it is worth noting that the thing that sets the Kalman filter apart from the Holt-Winters algorithm is that in the latter, we needed to recognize and characterize a reoccurring pattern in our data. For example, in the Red Fin housing sales, the data are conveniently regularly spaced in monthly intervals and seasonal patterns are readily apparent if not always predictable. The Kalman filter imposes no such requirements, hence its usefulness. It is often used in navigation activities within the aerospace field where it is applied to powered flight of a missile as readily as to a low-Earth orbiting spacecraft like Starlink to a libration orbiting spacecraft like JWST.

There are an enormous number of introductions to the Kalman filter each of them with their own specific strengths and weakness. I tend to draw on two works: An Introduction to the Kalman Filter by Greg Welch and Gary Bishop and Poor Man’s Explanation of Kalman Filtering or How I stopped Worrying and Learned to Love Matrix Inversion by Roger M. du Plessis. The latter work is particularly difficult to come by, but I’ve had it around for decades. That said, I also draw on a variety of other references which will be noted when used.

I suppose there are four core mathematical concepts in the Kalman filter: 1) the state of the system can be represented by a list of physically meaningful quantities, 2) these quantities vary or evolve in time in some mathematically describable way, 3) some combination (often nonlinear) of the state variables may be measured, 4) the state variables and the measurements of them are ‘noisy’.

Welch and Bishop describe these four mathematical concepts as follows. The state, which is described by a $n$-dimensional, real array of variables

\[ x \in \mathbb{R}^n \; \]

evolves according to the linear stochastic differential equation

\[ x_k = A x_{k-1} + B u_{k-1} + w_{k-1} \; , \]

where $A$, which is known as the $n \times n$ real-valued process matrix, provides the natural time evolution of the state at earlier time $t = k-1$ to the state at time $t = k$. The quantity $u \in \mathbb{R}^{\ell} $ is a control or forcing term that is absent in the natural dynamics but can be impose externally on the system. The $n \times \ell$, real-valued matrix $B$ maps the controls into the state evolution. The term $w \in \mathbb{R}^n$ is the noise in the dynamics; it typically represents those parts of the natural evolution or of the control that are either not easily modeled or are unknown.

Measurement of the state is represented an $m$-dimensional, real-valued vector $z \in \mathbb{R}^m$ and is related to the state by the equation

\[ z_k = H x_k + v_k \; , \]

where $H$ is a $m \times n$ real-valued matrix and $v \in \mathbb{R}^m$ is the noise in the measurement.

In addition, both noise terms are assumed to be normally distributed with their probabilities described by

\[ p(w_k) = N(0,Q_k) \; \]

and

\[ p(v_k) = N(0,R_k) \; , \]

where $Q_k \in \mathbb{R}^{n \times n}$ is the process noise covariance matrix and $R_k \in \mathbb{R}^{m \times m}$ is the measurement noise covariance matrix. We further assume that they are uncorrelated with each other. Generally, these noise covariances are time-varying since their values typically depend on the state.

Note that both the state and measurement equations are linear. More often than not, Kalman filtering is applied to nonlinear systems but in those cases, the estimation process is linearized and the above equations are used. Whether that is a good idea or not is a discussion for another day.

The role of the Kalman filter is to produce the best estimate of the state at a given time given the above equations. To do so, the filter algorithm draws heavily on concepts from Bayesian statistics to produce a maximum likelihood estimate.

In terms of Bayesian statistics, the algorithm assumes both an a priori state estimate ${\hat x}^{-}_{k}$ and an a posteriori one ${\hat x}_{k}$. The transition from the a priori to the a posteriori is made with

\[ {\hat x}_k = {\hat x}^{-}_{k} + K_k (z_k – H {\hat x}^{-}_k ) \; , \]

where the ‘magic’ all happens due to the Kalman gain $K_k$ defined as

\[ K_k = P^{-}_k H^T \left( H P^{-}_k H^T + R \right) ^{-1} \; , \]

where ‘$T$’ indicates a matrix transpose.

Note that this equation is functionally similar to the running average form calculated as

\[ {\bar x} = {\bar x}^{-} + \frac{1}{N} \left( x_N – {\bar x}^{-} \right) \; . \]

We’ll return to that similarity in a future post.

For now, we will round out the menagerie of mathematical beasts by noting that $P_k$ is the state covariance, a $n \times n$, real-valued matrix, that summarizes the current statistical uncertainty in the state estimation. Formally, it is defined as

\[ P_k = E \left[ (x_k – {\hat x}_k ) (x_k – {\hat x}_k )^T \right] \; , \]

where the ‘$E$’ stands for expectation over the state distribution.

The state covariance can be associated with the a priori estimate, where it is denoted as $P^{-}_k$ or with the a posteriori estimate, where it is denoted by $P_k$. The two are related to each other by

\[ P_k = A P^{-}_k A^T + Q \; . \]

The Kalman algorithm is best summarized using the following flow diagram adapted from the one in Introduction to Random Signals and Applies Kalman Filtering 4^th Edition by Brown and Hwang.

In future blogs, we’ll explore how to derive the magical Kalman gain, some applications of it to agent-based time series estimation, and a variety of related topics. It is worth noting that the notation is fluid, differing from author to author and application to application.

Uncategorized

Mar

2024

Time Series 3 – Exponential Smoothing and Holt-Winter

Conrad SchiffFebruary 23, 2024

In the last post, we examined the Holt-Winter scheme for tracking the level, trend, and seasonal variations in a time series in a sequential fashion with some synthetic data designed to illustrate the algorithm in as clean a way as possible. In this post, we’ll try the Holt-Winter method against real world data for US housing sales and will set some of the context for why the method works by comparing it to a related technique called the moving average.

The data analyzed here were obtained from RedFin (https://www.redfin.com/news/data-center/) but it isn’t clear for how long RedFin will continue to make these data public as they list the data as being ‘temporarily released’. As a result, I’ve linked the data file I’ve used here.

We’re going to approach these data in two ways. The first is by taking a historical look at the patterns in the data from the vantage point of hindsight on the entire span of home sales having been collected. In the second approach, we imagine what an agent working in the past thinks as the data come in one record at a time.

The historical look starts with an overview of the number of homes sold in the time period starting in Feb 2012 and ending at May 2023.

These data show both seasonal and overall trend variations and so our expectation might be that Holt-Winter would do a good job but note two things: First, with the exception of the first pandemic year of 2020, each of the years shows the same pattern: sales are low in the winter months and strong in the summer ones. Second the trend (most easily seen by focusing on the summer peak) shows four distinct regions: a) from 2012-2017 there is an overall upward trend, b) from 2017-2020 the trend in now downward with a much shallower slope, c) the start of the pandemic lockdowns in 2020 breaks the smoothness of the trend and then the trend again has a positive slope over 2020-2021, and d) the trend is strongly downward afterwards. These data exhibit a real-world richness that the contrived data used in the last post did not and they should prove a solid test for a time series analysis agent/algorithm.

Depending on how ‘intelligent’ we want our analysis agent to be we could look at a variety of other factors to explain or inform these features. For our purposes, we’ll content ourselves with looking at one other parameter, the median home sales price, mostly to satisfy our human curiosity.

These data look much more orderly in their trend and seasonal variation over the time span from 2012-2020. Afterwards, there isn’t a clear pattern in terms of trend and season.

Our final historical analysis will be to try to understand the overall pattern of the data using a moving average defined as:

\[ {\bar x}_{k,n} = \frac{1}{n} \sum_{i =k-n/2}^{k+n/2} x_i \; . \]

The index $k$ specifies to which point of the underlying and $n$ the number of points to be used in the moving average. Despite the notation, $n$ is best when odd so that there are as many points before the $k$th one as there are after as this prevents the moving average from introducing a bias which shifts a peak in the average off of the place in the data where it occurs. In addition, there is an art in the selection of the value of $n$ between it being too small, thereby failing to smooth out unwanted fluctuations, and being too large which smears out the desired patterns. For these data, $n = 5$. The resulting moving average (in solid black overlaying the original data in the red dashed line) is:

Any agent using this technique would clearly be able to describe the data as having a period of one year with a peak in the middle and perhaps an overall upward trend from 2012 to 2022 but then a sharp decline afterwards. But two caveats are in order. First and the most important one, the agent employing this technique to estimate a smoothed value on the $k$th time step must wait until at least $n/2$ additional future points have come in. This requirement usually precludes being able to perform predictions in real time. The second is that the moving average is computationally burdensome when $n$ is large.

By contrast, the Holt-Winter method can be used by an agent needing to analyze in real time and it is computationally clean. At the heart of the Holt-Winter algorithm is the notion of exponential smoothing where the smoothed value at the $k$th step, $s_k$, is determined by the previous smoothed value $s_{k-1}$ and the current raw value $x_k$ according to

\[ s_k = \alpha x_k + (1-\alpha) s_{k-1} \; . \]

Since $s_{k-1}$ was determined from a similar expression at the time point $k-1$, one can back substitute to eliminate all the smoothed values $s$ on the right-hand side in favor of the raw ones $x$ to get

\[ s_k = \alpha x_k + (1-\alpha)x_{k-1} + (1-\alpha)^2 x_{k-2} + \cdots + (1-\alpha)^k x_0 \; . \]

This expression shows that the smoothed value $s_k$ is a weighted average of all the previous points making it analogous to the sequential averaging discussed in a previous post but the exponential weighting by $(1-\alpha)^n$ makes the resulting sequence $s_k$ look more like the moving average. In some sense, the exponential smoothing straddles the sequential and moving averages giving the computational convenience of the former while providing the latter’s ability to follow variations and trends.

How closely the exponentially smoothed sequence matches a given $n$-point moving average depends on the selection of the value of $\alpha$. For example, with $\alpha = 0.2$ the exponentially smoothed curve gives

whereas $\alpha = 0.4$ gives

Of the two of these, the one with $\alpha=0.4$ much more closely matches the 5-point moving average used above.

The Holt-Winter approach using three separate applications of exponential smoothing, hence the need for the three specified parameters $\alpha$, $\beta$, and $\gamma$. Leslie Major presents an method for optimizing the selection of these three parameters in her video How to Holts Winters Method in Excel & optimize Alpha, Beta & Gamma.

We’ll skip this step and simply use some values informed by the best practices that Major (and other YouTubers) note.

The long-term predictions given by our real time agent are pretty good in the time span 2013-2018. For example, a 24-month prediction made in February 2013 looks like

Likewise, a 24-month prediction in June 2017 looks like

Both have good agreement with a few areas of over or under-estimation. The most egregious error is the significant overshoot in 2019 which is absent in the 12-month prediction made a year later.

All told, the real time agent does an excellent job of predicting in the moment but it isn’t perfect as is seen by how the one-month predictions falter when the pandemic hit.

Uncategorized

Feb

2024

Time Series 2 – Introduction to Holt-Winter

Conrad SchiffJanuary 26, 2024

In the last blog, I presented a simple sequential way of analyzing a time series as data are obtained. In that post, the average of any moment $x^n$ was obtained in real-time by simply tracking the appropriate sums and number of points seen. Of course, in a real world application, there would have to be a bit more intelligence built into the algorithm to allow an agent employing it to recognize when a datum is corrupted or bad or missing (all real world problems) and to exclude these point both from the running sums and from the number of points processed.

This month, we look at a more sophisticated algorithm for analyzing trends and patterns in a time series and for projecting that analysis into the future sequential using these patterns. The algorithm is a favorite in the business community because, once an initial ‘training’ set has been digested, the agent can update trends and patterns with each new datum and then forecast into the future. The algorithm is called the Holt-Winter triple exponential smoothing and it has been used in the realm of business analytics for forecasting the number of home purchases, the revenue from soft drink sales, ridership on Amtrak, and so on, based on a historical time series of data.

Being totally unfamiliar with this algorithm until recently, I decided to follow and expand upon the fine video by Leslie Major entitled How to Holts Winters Method in Excel & optimize Alpha, Beta & Gamma. In hindsight, this was a very wise thing to do because there are quite a few subtle choices for initializing the sequential process and the business community focuses predominantly on using the algorithm and not explaining rationale for the choices being followed.

For this first foray, I am using the ‘toy’ data set that Majors constructs for this tutorial. The data set is clean and well-behaved but, unfortunately, is not available from any link discernable associated with the video but I have reconstructed it (with a lot a pauses) and make it available here.

The data are sales $W$ of an imaginary item, which, in deference to decades of tradition, I call a widget. The time period is quarterly and a plot of the definitive sales $W$ from the first quarter of 2003 (January 1, 2003) through the fourth quarter of 2018 (October 1, 2018) shows both seasonal variations (with a monotonic ordering from the first quarter as the lowest to the fourth quarter as the highest)

as well as a definite upward trend for each quarter.

The Holt-Winter algorithm starts by assuming that the first year of data are initially available and that a first guess for a set of three parameters $\alpha$, $\beta$, and $\gamma$ (associated with the level $L$, the trend $T$, and the seasonal $S$ values for the number of widget sales, respectively) are known. We’ll talk about how to revise these assumptions after the fundamentals of the method are presented.

The algorithm consists of 7 steps. The prescribed way to initialize the data structures tracking the level, trend, and seasonal values is found in steps 1-4, step 5 is a boot-strap step initialization needed to start the sequential algorithm proper once the first new datum is obtained and steps 6-7 comprise the iterated loop step used for all subsequent data, wherein an update to the current interval is made (step 6) and then a forecast into the future is made (step 7).

In detail these steps require an agent to:

Determine the number $P$ of intervals in the period; in this case $P = 4$ since the data are collected quarterly.
Gather the first period of data (here a year) from which the algorithm can ‘learn’ how to statistically characterize it.
Calculate the average $A$ of the data in the first period.
Calculate the ratio of each interval $i$ in the first period to the average to get the seasonal scalings $S_i = \frac{V_i}{A}$.
Once the first new datum $V_i$ ($i=5$) (for the first interval in the second period) comes in, the agent then bootstraps by estimating:
1. the level in the first interval by making a seasonal adjustment. Since the seasonal level is not yet known, the agent uses the seasonal value in the previous period in the first interval using $L_i = \frac{V_i}{S_{i-P}}$;
1. the trend of the first interval in the second period using $T_{i} = \frac{V_i}{S_{i-P}} – \frac{V_{i-1}}{S_{i-1}}$. This odd looking formula is basically the finite difference between the first interval of the second period and the last interval of the first period, each seasonally adjusted. Again, since the seasonal level is not yet know, the agent uses the seasonal value from the corresponding earlier interval for the first interval of the second period;
1. the seasonal ratio of the first interval in the second period using $S_i = \gamma \frac{V_i}{L_i} + (1-\gamma) S_{i-P}$.
Now, the agent can begin sequential updating in earnest by using all of the weighted or blended averages of the data in current and past intervals to update:
1. the level using $L_i = \alpha \frac{V_i}{S_{i-P}} + (1-\alpha)(L_{t-1} + T_{t-1})$
1. the trend using $T_i = \beta ( L_i – L_{i-1} ) + (1-\beta) T_{t-1}$
1. the seasonal ratio (again) using $S_i = \gamma \frac{V_i}{L_i} + (1-\gamma) S_{i-P}$
Finally, the agent can forecast as far forward as desired (using $F_{i+k} = (L_i + k \cdot T_i) S_{i – P + k}$. where $k$ is an integer representing the number of ‘intervals’ ahead to be forecasted. There are some subtleties associated with what the agent can do with a finite number of historical levels of the seasonal ratios, so, depending on application, specific approximations for $S_{i – P + k}$ must be made.

Using the widget sales data, the first 4-quarter forecast compares favorable to the actuals as seen in the following plot.

When forecasting only a quarter ahead, as Major herself does, the results are also qualitatively quite good.

Finally, a note about selecting the values of $\alpha$, $\beta$, and $\gamma$. There is a general rule of thumb for initial guesses but that the way to nail down the best values is to use an optimizer to minimize the RMS error between a forecast and the actuals. Majors discusses all of these points and shows how Excel can be used in her video to get even better agreement.

For next month, we’ll talk about the roots of the Holt-Winter algorithm in the exponential smoother (since it is applied to three parameters – level, trend, and seasonal ratio – that is why the algorithm is often called triple exponential smoothing).

Uncategorized

Jan

2024

Time Series 1 – Sequential Averaging

Conrad SchiffNovember 24, 2023

A hallmark of intelligence is the ability to anticipate and plan as events occur as time elapses. Evaluating and judging each event sequentially is how each of us lives but, perhaps ironically, this is not how most of us are trained to evaluate a time series of data. Typically, we collect quantitative and/or qualitative data for some time span, time tagging the value of each event, and then we analyze the entire data set in as one large batch. This technique works well for careful, albeit leisurely, exploration of the world around us.

Having an intelligent agent (organic or artificial) working in the world at large necessitates using both techniques – real time evaluation in some circumstances and batch evaluation in others – along with the ability to distinguish when to use one versus the other.

Given the ubiquity of time series in our day-to-day interactions with the world (seasonal temperatures, stock market motion, radio and TV signals) it seems worthwhile to spend some space investigating a variety of techniques for analyzing these data.

As a warmup exercise that demonstrates the differences between real time and batch nuances involved, consider random numbers pulled from a normal distribution with a mean ($\mu = 11$) and standard deviation ($\sigma = 1.5$).

Suppose that we pull $N=1000$ samples from this distribution. We would expect that the sample mean, ${\bar x}$, and sample deviation $S_x$ to be equal to the population mean and standard deviation to within 3% (given that $1/\sqrt(1000) \approx 0.03$). Using numpy, we can put that expectation to the test, finding that for one particular set of realizations that the batch estimation over the entire 1000-sample set is ${\bar x} = 11.08$ and $S_x = 1.505$.

But how would we estimate the sample mean and standard deviation as points come in? With the first point, we would be forced to conclude that ${\bar x} = x_0$ and $S_x = 0$. For subsequent points, we don’t need to hold onto the values of the individual measurements (unless we want to). We can develop an iterative computation starting with the definition of an average

\[ {\bar x} = \frac{1}{N} \sum_{i=1}^N x_i \; .\]

Separating out the most recent point $x_N$ and multiplying the sum over the first $N-1$ points by the $1 = (N-1)/(N-1)$ gives

\[ {\bar x} = \frac{N-1}{N-1} \frac{1}{N} \left( \sum_{i=1}^{N-1} x_i \right) + \frac{1}{N} x_N \; .\]

Recognizing that the first term contains the average ${\bar x}^{(-)}$ over the first $N-1$ points, this expression can be rewritten as

\[ {\bar x} = \frac{N-1}{N} {\bar x}^{(-)} + \frac{1}{N} x_N \; .\]

Some authors then expand the first term and collect factors proportional to $1/N$ to get

\[ {\bar x} = {\bar x}^{(-)} + \frac{1}{N} \left( x_N – {\bar x}^{(-)} \right) \; .\]

Either of these iterative forms can be used to keep a running tally of the mean. But since the number of points in the estimation successively increases as a function of time, we should expect that difference between the running average and the final batch estimation to also be a function of time.

While there is definitive theory that tells us the difference between the running tally and the population mean there isn’t any theory that characterizes how it should rank relative to the batch estimation other than the generic expectation that as the number of points used in the running tally approaches the number in the batch estimation that the two should converge. Of course, the final value must exactly agree as there were no approximations made in the algebraic manipulations above.

To characterize the performance of the running tally, we look at a variety of experiments.

In the first experiment, the running tally fluctuates about the batch mean before settling in and falling effectively on top.

But in the second experiment, the running tally starts far from the batch mean and (with only the exception of the first point) stays above the until quite late in the evolution.

Looking at the distribution of the samples on the left-hand plot shows that there was an overall bias relative to the batch mean with a downward trend, illustrating how tricky real data can be.

One additional note, the same scheme can be used to keep a running tally on the average of $x^2$ allowing for a real time update of the standard deviation from the relation $\sigma^2 = {\bar {x^2}} – {\bar x}^2$.

As the second case demonstrates, our real time agent may have a significantly different understanding of the statistics than the agent who can wait and reflect over a whole run of samples (although there is a philosophical point as to what ‘whole run’ means). In the months to come we’ll explore some of the techniques in both batch and sequential processing.

Uncategorized

Nov

2023

The Measure of Man

Conrad SchiffOctober 27, 2023

Recently I started re-reading Umberto Eco’s The Name of the Rose. The action is set in the early part of the 14^th century (November of 1327 to be precise) at a Benedictine abbey whose name, to quote the fictitious narrator, “it is only right and pious now to omit.” One of the many themes of this novel, reflecting the scholastic philosophy ascendant at that time, deals with the nature of being human. Answering that ontological/anthropological question includes, as correlative concepts, the licitness of laughter and a categorization of the essential nature of what separates us from the lower animals, which philosophers often will call the beasts.

While being removed from our time by approximately 800 intervening years, the basic question as to ‘what is man’ is still as operative today as it was then or as it ever was. The yardsticks have changed but the question remains.

The aforementioned Scholastics crafted an image of the universe similar to what is shown below.

The universe consists of an abundant material world (contained in the blue triangle) with a hierarchy of complexity that I think a modern biologist would agree with (after he, of course, added all the still controversial divisions associated with microscopic life). At the lowest rung is the inanimate matter that makes bulk of the world around us in the form of solid, liquid, gas, and plasma. The next rung up (again skipping microscopic life, which, while important, was unknown prior to the late 1600s) consists of what was often called the vegetable kingdom comprised of the plants and plant-like life around us. Animals of all sorts, excepting man (philosophers typically call this subset ‘the beasts’ for brevity), comprise the next level up. The pinnacle of this material world is occupied by humans, a point that, although some among us would wish it not to be true, is difficult to refute.

The universe also consists of an abundant spiritual world (contained in the green triangle) with a hierarchy of complexity that is far more controversial because it elusively remains beyond our physical means of detection. For a Christian philosopher in the Middle Ages, the top of the hierarchy is the triune God composed of the Father, Son, and Holy Spirit as 3 persons in one being. Below the Trinity, the Scholastics imagined a world teaming with angels, the composition of which is traditionally divided into three choirs each comprised of three species according to the analysis of Thomas Aquinas. Finally, the lowest level of the spiritual realm is occupied by human beings.

Thus, a scholastic philosopher recognizes that the nature of man is a unique union between material and the spiritual, but the measure of man – what exactly separates him at that intersection from the more numerous species belonging entirely to either world – isn’t so clear.

One might think that as Western thought transitioned from the Middle Ages into the Renaissance and eventually through the Enlightenment and into our current Industrial/Digital age that the question would lose all of its force, but it stubbornly remains; only the trappings have changed. Where once an operative question was ‘how many angels could dance on a pinhead’ (and rightly so) we now ask questions about how many of those spiritual beings we label as ‘AI’ are needed to replace a screenwriter or a lawyer.

So, let’s examine some of the various activities that have been put forth as ways that Man separates himself both the beasts and, where appropriate, from the AI.

For our first activity, there is Man’s propensity to make or build. One need only glance at one of the world’s greatest cities, say Manhattan, to be impressed with the size, scale, and variety of construction that is present. But construction is not unique to Man as a variety of insects, for example termites, build elaborate communal living structures. And generative AI often returns structures far more elaborate than envisioned by most of the world’s architects (Gaudi not withstanding).

Others have argued that language and communication are what separate Man but then what does one do with gorilla communication or parrots who clearly ‘talk’? And where does ChatGPT’s dialogs and replies fit into this schema?

The ability to reason is often proffered as a possibility and the impressive amount of reasoning that has been produced, particularly in the fields of the science and mathematics, seems to reflect the unique nature of Man. But at Peter Kreeft points out in his lectures entitled Ethics: A History of Moral Thought, beasts also show a degree of reasoning. He cites an example of a dog pursuing a hare who comes to a three-fold fork in the road. After sniffing two of the trails with no success, the dog immediately pursues the hare down the third without bothering to sniff. And, of course, expert systems and symbolic logic programs have been ‘reasoning’ for years and remain important components in many fields.

The list could go on, but the point should already be clear. Much like one of Plato’s Socratic dialogs, this argument over what separates Man from the other autonomous agents that inhabit the world around him (beasts and AI) is not easy to resolve. Clearly there is some character in Man that sets him apart as he seems to be the only entity that exhibits the ability in the material world to make moral judgements based on The Good, The True, and The Beautiful but distilling what that ability is remains elusive.

This elusive, ineffable nature of man is symbolized in Eco’s arguments in The Name of the Rose by the continued debate within the monastic community over the nature of laughter. The following snippet of dialog between Jorge of Burgos and William of Baskerville give a flavor of that debate:

“Baths are a good thing,” Jorge said, “and Aquinas himself advises them for dispelling sadness, which can be a bad passion when it is not addressed to an evil that can be dispelled through boldness. Baths restore the balance of the humors. Laughter shakes the body, distorts the features of the face, makes man similar to the monkey.”

“Monkeys do not laugh; laughter is proper to man, it is a sign of his rationality,” William said.

“Speech is also a sign of human rationality, and with speech a man can blaspheme against God. Not everything that is proper to man is necessarily good.”

Wrestling with this question is by no means restricted to ‘high-browed’, historic fiction. Consider the following snippet from Guardian of Piri episode of Space 1999 (aired Nov of 1975) wherein Commander Konig gives his own answer for what sets man apart.

Clearly, we recognize that there is something innately special about Man even if we can’t nail down what precisely what that means. The debate surrounding this point, which has no doubt existed as long as Man himself has had the self-awareness to wonder about his own nature, is likely to continue for as long Man lasts.

Uncategorized

Oct

2023

Mahalanobis distance

Conrad SchiffSeptember 29, 2023

In last month’s column, we looked at the Z-score as a method of comparing members of disparate populations. Using the Z-score, we could support the claim that Babe Ruth’s home run tally of 50 in 1920 was a more significant accomplishment than Barry Bonds’s 70 dingers in 2001. This month, we look at an expansion of the notion of the Z-score due to P. C. Mahalanobis in 1936, called the Mahalanobis distance.

To motivate the reasoning behind the Mahalanobis distance, let’s consider the world of widgets. Each widget has a characteristic width and height, of which we have 100 samples. Due to manufacturing processes, widgets typically have a spread in the variation of width and height about their standard required values. We can assume that this variation in width and height, which we will simply refer to as Delta Width and Delta Height going forward, can be either positive or negative with equal probability. Therefore, we expect that the centroid of our distribution in Delta Width and Delta Height to essentially be at the origin. We’ll denote the centroid’s location by a red dot in the plots that follow.

Now suppose that we have a new widget delivered. How do we determine if this unit’s Delta Width and Delta Height were consistent with others we’ve seen in our sample. If we denote the new unit’s value by a green dot, we can visualize how far it is from the centroid.

For comparison, we also plotted one of our previous samples (shown in blue) that has the same Euclidean distance from the centroid as does the green point (31.8 for both in the arbitrary units used for Delta Width and Delta Height). Can we conclude that the green point is representative of our other samples?

Clearly the answer is no, as can be seen by simply adding the other samples, shown as black points.

We intuitively feel that the blue point (now given an additional decoration of a yellow star) is somehow closer to cluster of black points but a computation of Z-scores doesn’t seem to help. The Z-scores for width and height for the blue point are: -2.49 and -2.82, respectively, while the corresponding values for the green point are -2.51 and 2.71.

The problem is that Delta Width and Delta Height are strongly correlated. One strategy is to follow the methods discussed in the post Multivariate Sampling and Covariance and move to a diagonalized basis. Diagonalizing the data leaves the variations expressed in an abstract space spanned by the variables X and Y, which are linear combinations of the original Delta Width and Delta Height values. The same samples plotted in these coordinates delivers the following plot.

Using the X and Y coordinates as our measures, we can calculate the corresponding Z-scores for the blue point and the new green one. The X and Y Z-scores for the blue point are now: -2.74 and 0.24. These values numerically match our visual impression that the blue point, while on the outskirts of the distribution in the X-direction lies close to the centroid in the Y-direction. The corresponding X and Y Z-scores for the green point are: 0.00 and 10.16. Again, these numerical values match our visual impression that the green point is almost aligned with the centroid in the X-direction but is very far from the variation of the distribution along the Y-direction.

Before moving onto how the Mahalanobis distance handles this for us, it is worth noting that the reason the situation was so ambiguous in W-H coordinates was that when we computed the Z-scores in the W- and H- directions, we ignored the strong correlation between Delta Width and Delta Height. In doing so, we were effectively judging Z-scores in a bounding box corresponding to the maximum X- and Y-extents of the sample rather than seeing the distribution as a tightly grouped scatter about one of the bounding box’s diagonals. By going to the X-Y coordinates we were able to find the independent directions (i.e. eigendirections).

The Mahalanobis distance incorporates these notions, observations, and strategies by defining a multidimensional analog of the Z-score that is aware of the correlation between the variables. Its definition is

\[ d_M = \sqrt{ \left({\mathbf O} – {\bar {\mathbf O}} \right)^T S^{-1} \left({\mathbf O} – {\bar {\mathbf O}} \right) } \; , \]

where ${\mathbf O}$ is a column array of the values for the current point (i.e. the new, green one), ${\bar {\mathbf O}}$ is the average across the sample, and $S^{-1}$ is the inverse of the covariance matrix. Forming the radical is equivalent to calculating the square of the Z-score, which is easily seen by fact the $S^-1$ has units that are inverses of observations squared (e.g., in this case $1/length^2$). This observation also supports why the square root is needed at the end.

For the sample used in this example, the average values of W and H were: -0.172600686508044 and 0.17708029312465146. Using these gives the array

\[ {\bar {\mathbf O}} = \left[ \begin{array}{c} -0.172600686508044 \\ 0.17708029312465146 \end{array} \right] \; . \]

The covariance matrix was given by

\[ S = \left[ \begin{array}{cc} 78.29662631 & 63.08196815 \\ 63.08196815 & 67.35298221 \end{array} \right] \; . \]

The blue points observation vector is

\[ {\mathbf O}(blue) = \left[ \begin{array}{c} -22.109910673364368 \\ -22.83313653428484 \end{array} \right] \; \]

and its Mahalanobis distance from the centroid is

\[ d_M(blue) = 2.805 \; . \]

In contrast the green point as an observation vector of

\[ {\bar {\mathbf O}} = \left[ \begin{array}{c} -22.30184428079569 \\ 22.306323887412297 \end{array} \right] \; \]

giving a Mahalanobis distance from the centroid of

\[ d_M(green) = 10.142 \; . \]

Note that the magnitude of the components of blue and green observation vectors are almost identical; the critical difference being the sign on the H-component. That sign difference reflects the large deviation away from the correlation that exists in the W-H components, which shows up in the large differences in the Mahalanobis distance.

Finally, note that the Mahalanobis distance for both the blue and green point is the root-sum-square of the individual Z-scores in X-Y coordinates; this is an important point that only holds in the diagonalized coordinate system.

By exploiting the coordinate invariance of an bilinear form, the Mahalanobis distance provides the basic machinery of calculating Z-scores in the diagonalized coordinate system without the bother of actually having to carry out the diagonalization.

Uncategorized

Sep

2023

Baseball and Giraffes

Conrad SchiffAugust 25, 2023

While it may or may not be true, as Alfred Lord Tenneson penned in his poem Locksley Hall, that “In the Spring a young man’s fancy lightly turns to thoughts of love” it is definitely true that “In the Summer a sports fan’s fancy turns heavily to thoughts of baseball”. Long a staple of the warmer portion of the year, America’s pastime is as much an academic pursuit into obscure corners of statistical analysis as it is an athletic endeavor. Each season, thousands upon thousands of die-hards argue about thousands upon thousands of statistics using batting averages, on base percentage, earned run averages, slugging percentages and so on to defend their favorite players or to attack someone else’s favorites. And no testament could stand more clearly for the baseball enthusiasts love of the controversy stirred up by statistics than the movie 61*.

For those who aren’t used to seeing ‘special characters’ in their movie titles and who don’t know what significance the asterisk holds a brief explanation is in order. Babe Ruth, one of the best and most beloved baseball players, set many of baseball’s records including a single season record of 60 home runs in 1927. As the Babe’s legend grew and the years from 1927 faded into memory without anyone mounting much of a challenge so did the belief that no one could ever break his record. In 1961, two players from the New York Yankees, Mickey Mantle and Roger Maris, hit home runs at a pace (yet another baseball statistic) to make onlookers believe that either of them might break a record dating back to before World War II and the Great Depression. Eventually, Roger Maris broke the record in the 162^nd game of the season, but since Ruth reached his mark of 60 home runs when baseball only played 154 games, the Commissioner qualified the Maris record with an ‘*’ as a means of protecting the mystique of Babe Ruth.

And for years afterwards, fans argued whether Maris really did break the record or not. Eventually, Maris’s record would also fall, first to Mark McGwire, who hit 70 ‘dingers’ (baseball lingo for home runs) in 1998 and then Barry Bonds, who hit 73 in 2001. What’s a baseball fan to do? Should McGwire’s and Bonds receive an ‘*’? How do we judge?

At first glance, one might argue that the best way to do it would be to normalize for the number of games played. For example, Ruth hit 60 HRs (home runs – an abbreviation that will be used hereafter) in 154 games so his rate was 0.3896. Likewise, Bonds’s rate is 0.4506. And so, can we conclude that Bonds is clearly the better home run hitter. But not so fast the purist will say, Bond’s and McGwire hit during the steroids era in baseball, when everyone, from the players on the field to the guy who shouts ‘Beer Here!’ in the stands, was juicing (okay… so maybe not everyone). Should we put stock in the purists argument or has Bonds really done something remarkable?

This is a standard problem in statistics when we try to compare two distinct populations be they be separated in time (1920s, 1960s, 1990s) or geographically (school children in the US v. those in Japan) or even more remotely separated, for example by biology. The standard solution is to use a Z-score, which normalizes the attributes of a member of a population to its population as whole. Once the normalization is done, we can then compare individuals in these different populations to each other.

As a concrete example, let’s compare Kevin Hart’s height to that of Hiawatha the Giraffe. The internet lists Kevin Hart as being 5 feet, 2 inches tall and let us suppose that Hiawatha is 12 feet, 7 inches tall. If we are curious which of them would be able to fit through a door with a standard US height of 6 feet, 8 inches then a simple numerical comparison shows us that Hart will while Hiawatha won’t. However, this is rarely what we want. When we ask which is shorter or taller, it is often the case where we want to know if Hart is short as a man (he is) and if Hiawatha is short as a giraffe (she is as well). So, how do we compare?

Using the Z-score (see also the post K-Means Data Clustering), we convert the individual’s absolute height to a population-normalized height by the formula

\[ Z_{height} = \frac{height – \mu_{height}}{\sigma_{height}} \; , \]

where $\mu_{height}$ is the mean height for a member of the population and $\sigma_{height}$ is the corresponding standard deviation about that mean.

For the US male, the appropriate statistics are a $\mu_{height}$ of 5 feet, 9 inches and a $\sigma_{height}$ of 2.5 inches. Giraffe height statistics are a little hard to come but according to the Denver Zoo, females range between 14 and 16 feet. Assuming this to be a 3-sigma bound, we can reasonably assume a $\mu_{height}$ of 15 feet with a $\sigma_{height}$ of 4 inches. A simple calculation then yields a $Z_{height} = -2.8$ for Hart and $Z_{height} = -7.25$ showing clearly that Hiawatha is shorter than Kevin Hart, substantially shorter.

Now, returning to baseball, we can follow a very interesting and insightful article by Randy Taylor and Steve Krevisky entitled Using Mathematics And Statistics To Analyze Who Are The Great Sluggers In Baseball’. They looked at mix of hitters from both the American and National leagues of Major League Baseball, scattered over the time span from 1920-2002, and record the following statistics (which I’ve ordered by date).

Year	Hitter	League	HR	Mean HR	HR Standard Deviation	Z
1920	Babe Ruth	AL	54	4.85	7.27	6.76
1921	Babe Ruth	AL	59	6.05	8.87	5.97
1922	Rogers Hornsby	NL	42	6.31	7.18	4.97
1927	Babe Ruth	AL	60	5.42	10.05	5.43
1930	Hack Wilson	NL	56	10.86	11.2	4.03
1932	Jimmie Foxx	AL	58	8.59	10.32	4.79
1938	Hank Greenberg	AL	58	11.6	11.88	3.91
1949	Ralph Kiner	NL	54	10.87	9.78	4.41
1954	Ted Kluszewski	NL	49	14.01	11.7	2.99
1956	Mickey Mantle	AL	52	13.34	9.39	4.12
1961	Roger Maris	AL	61	15.01	12.34	3.73
1967	Carl Yastrzemski Harmon Killebrew	AL	44	11.87	8.99	3.57
1990	Cecil Fielder	NL	51	11.12	8.74	4.56
1998	Mark McGwire	NL	70	15.44	12.48	4.37
2001	Barry Bonds	NL	73	18.03	13.37	4.11
2002	Alex Rodriguez	NL	57	15.83	10.51	3.92

Interestingly, the best HR season, in terms of Z-score, is Babe Ruth’s 1920 season where he ‘only’ hit 54 HRs. That year, he stood over 6.5 standard deviations from the mean making it far more remarkable than Barry Bonds 73-HR season. Sadly, Roger Maris’s 61*-HR season is one of the lower HR championships on the list.

To be clear, it is unlikely that any true, dyed-in-the-wool, baseball fan will be swayed by statistical arguments. Afterall, where is the passion and grit in submitting an opinion to the likes of cold, impartial logic. Nonetheless, the Z-score is not only in its own right but serves also as a launching pad for more sophisticated statistical measures.

Uncategorized

Aug

2023

To Winograd or not to Winograd

Conrad SchiffJuly 28, 2023

Since its inception, a common theme that has appeared frequently in this column is the nuances and ambiguities in natural language. There are several reasons for this focus but the two most important ones are that being able to handle linguistic gray areas is a real test for machine intelligence and that by looking at how computer systems struggle with natural language processing we gain a much better appreciation how remarkable the human capacity to speak really is.

Past columns have focused mostly on equivocation is various forms, with an emphasis on humor (Irish Humor, Humorous Hedging, Yogi Berra Logic, and Nuances of Language) and context-specific inference (Teaching a Machine to Ghoti and Aristotle on Whiskey). But the ‘kissing-cousin’ field of the Winograd Schema remained untouched because it had remained unknown. A debt of gratitude is thus owed to Tom Scott for his engaging video The Sentences Computers Can’t Understand, But Humans Can which opened this line of research into natural language processing by machine to me.

When Hector Levesque proposed it in 2011, he designed the Winograd Schema Challenge (WSC) to address perceived deficiencies in the Turing test by presenting a challenge requiring ‘real intelligence’ rather than the application of trickery and brute force. The accusation of trickery apparently particularly into view due to the Eugene Goostman chatbot, a system that portrayed itself as a 13-year old boy from Odesa in the Ukraine, having fooled roughly 30% of the human judges in large Turing Test competition in 2014. To achieve this, the Wikipedia article maintains that the bot used ‘personality quirks and humor to misdirect’, which basically means that the judges were conned by the creators. The idea of a confidence man pulling the wool over someone eyes probably never occurred to Alan Turing nor to the vast majority of computer scientists but anyone who’s seen a phishing attempt is all too familiar with that style of chicanery.

The essence of the WSC is to ask a computer intelligence to resolve a linguistic ambiguity that requires more than just a grammatical understanding of how the language works (syntax) and the meaning of the individual words and phrases (semantics). Sadly the WSC doesn’t focus on equivocation (alas the example below will not be anything more than incidentally humorous) but rather on what linguists call an anaphora, which is the formal meaning attached to an expression whose meaning must be inferred from an earlier part of the sentence, paragraph, etc.

A typical example for an anaphora involves the use of a pronoun in a sentence such as

John arrived at work on time and entered the building promptly but nobody claimed to have seen him enter.

Here the pronoun ‘him’ is the anaphora and is understood to have John as its antecedent.

The fun of the WSC is in creating sentences in which the antecedent can only be understood contextually as the construction is ambiguous. One of the typically examples used in explaining the challenge reads something like

The councilmen refused to grant the protestors a permit to gather in the park because they feared violence.

The question, when posed to a machine, would be to ask if the ‘they’ referred to the councilmen or the protestors. Almost all people would find no ambiguity in that sentence because they would argue that the protestors would be the ones making the ruckus and that the councilmen, either genuinely worried about their constituents or cynically worried about their reputations (or a mix of both), would be the ones to fear what might happen.

Note that the sentence easily adapts itself to other interpretations with only the change of one word. Consider the new sentence

The councilmen refused to grant the protestors a permit to gather in the park because they threatened violence.

Once again, a vast majority of people would now say that the ‘they’ referred to the protestors because the councilmen would not be in the position to threaten violence (although things may be changing on this front).

The idea here is that the machine would need not just the ability to analyze syntax with a parser and the ability to look up words with a dictionary but it would also need to reason and that reasoning would be broad rather than narrowly focused. Relationships between concepts would be varied and far-ranging with the sentence

The trophy couldn’t fit into the suitcase because it was too large.

Here the ontology would center on spatial reasoning, the ideas of ‘big’ and ‘little’, and the notion that suitcases usually contain other objects.

These types of ambiguous sentences seem to be part and parcel of day-to-day interactions. For example, the following comes from Book 5 of the Lord of the Rings

The big orc, spear in hand, leapt after him. But the tracker, springing behind a stone, put an arrow in his eye as he ran up, and he fell with a crash. The other ran off across the valley and disappeared.

This scene, which takes place after Frodo and Sam have managed to escape the tower of Cirith Ungol, is between a large fighter orc and a smaller tracker. Simple rules of syntax might lead the machine to believe that the ‘he’ in the second sentence would have ‘the tracker’ as its antecedent. I doubt any human reader was fooled.

The complexity of the WSC is not limited to only two objects. Consider the following example taken from the Commonsense Reasoning ~ Pronoun Disambiguation Problems database:

I asked Dave to get me my sweater from the other side of the yacht. While he was gone, I rested my arm on the rail over there and suddenly it gave way.

The machine is to pick from the multiple choices (a) sweater (b) yacht (c) arm (d) rail. Again, I doubt that any person, of sufficient age, would be confused by this example and so wonder why. We are literally surrounded with ambiguous expressions every day. Casual speech thrives on these corner-cutting measures. Even our formal writings are not immune; there are numerous examples of these types of anaphoras in all sorts of literature with the epistles of St. Paul ranking up there for complexity and frequency. Humans manage quite nicely to deal with them – a tribute to the innate intelligence of the species as a whole.

But knowing humans to be intelligent and observing them being able to deal with these types of ambiguity does not mean the converse if true. Being able to pass the WSC does not mean the agent is necessarily smart. The argument for this conclusion comes, disappointingly, from the fact that it has already been overcome by various algorithms 7 years after its proposal. Sampling the associated papers, the reader will soon find that much of the magic comes from a different flavor of statistical association, indicating that the real intellect resides in the algorithm designer. This is a point raised by The Defeat of the Winograd Schema Challenge by Vid Kocijan, Ernest Davis, Thomas Lukasiewicz, Gary Marcus, Leora Morgenstern. To quote from section 4 of their paper:

The Winograd Schema Challenge as originally formulated has largely been overcome. However, this accomplishment may in part reflect flaws in its formulation and execution. Indeed, Elazar et al. (2021) argue that the success of existing models at solving WSC may be largely artifactual. They write:

We provide three explanations for the perceived progress on the WS task: (1) lax evaluation criteria, (2) artifacts in the datasets that remain despite efforts to remove them, and (3) knowledge and reasoning leakage from large training data.

In their experiments, they determined that, when the form of the task, the training regime, the training set, and the evaluation measure were modified to correct for these, the performance of existing language models dropped significantly.

At the end of the day, I think it is still safe to say that researchers are clever in finding ways to mimic intelligence in small slices of experience but that nothing still approaches the versatility and adaptability of the human mind.

Uncategorized

Jul

2023

Latest Posts

Post navigation