Latest Posts

A Cross-Country A*

Update 6/29/18: Please note that there is an error in this post.  There should be included in the reckoning of the cost of a space the total movement cost from the starting space to the space in question.  Please consult the Tweaks and Kinks in the A* post for details.

Last month’s column introduced and explored the basic A* algorithm within the cheeky context of a snake, a man, and a path to revenge.  In this month’s post, the algorithm is extended to include terrain costs that take into account that different types of spots may be more expensive to traverse based on their nature (e.g., mountain, river, forest, desert, etc.).  As in the simpler case already studied, the A* seeks to find the shortest path between a starting point and the goal when the latter is known but the best way to travel is not.

It is important to remember that at the core of the A* algorithm is the idea of an open spot.  An open spot is a location adjacent to the current spot that the algorithm is visiting.  From the vantage of the current spot, the neighboring spots can be examined and scored based on two criteria:  the cost to move into that spot and a heuristic guess as to how far the spot is from the goal.  These actions open the spot, which is placed on a list for future exploration.  Once the A* has finished examining the neighborhood of the current spot, it closes that spot and selects the next spot to visit from among the lowest scoring spots on the list of open spots.  By keeping pointers from open spots to the parent spots that opened them, the A* is able to back-track from the goal (once it has found it) to the beginning spot, thus determining the shortest path.  The reader is encouraged to read the earlier post (An A* Drama) for more details.

As in the previous post, the material here is strongly influenced by Chapter 7 of AI for Game Developers, by David M. Bourg and Glenn Seemann.

This month’s story begins with a wayfarer, who is journeying in the wilderness, looking for the shortest path home to his city.

He weighs his options for path based not just on distance but on how well he can travel across each type of terrain.   Normalizing his speed to the plain (blank of unfilled space), the wayfarer estimates that he can move 2 faster across the plain than through the forest, 5 times faster than crossing the water, 7 time faster than moving across the desert, and 10 times faster than climbing over the mountains.  To evaluate his options he records his travel-time estimates and then lays a grid over his map to assist with his planning.

The mountain range to his north-east lies in a direct, as-the-crow-flies path from his current location at grid point (4,2) to the city at (9,6), but taking this route may not be the most time-optimal approach.  He starts looking for ways to go around rather than over the mountain range.

By going through the desert to the east and south-east, he can head in roughly the ‘right’ direction and skirt the mountains completely, but the desert is not much cheaper than the mountains.  He also considers going due north along the river course and then turning east in the forest, but a river trip is 5 times slower than in the plain.  Finally he considers the possibility of going in the completely ‘wrong’ direction by heading to the south-west, where a ford allows him to cross the river on foot.  Once on the other side, he can follow the flow upstream (to the north-east) until he enters the forest and can approach the city from the east.  He scores these different possibilities and find that the last route is the cheapest.

Given the same information, will the A* also find this over-the-river-and-through-the-woods path?

The first step to answering this question is to add the terrain cost to the A* algorithm.  This is done by simply using the time-scaling the wayfarer had already noted to the local score when a spot is opened.    For example, when the search begins, the A* starts at (4,2) and opens the 8 locations surrounding that spot.  Grid locations (4,3), (3,3), and (3,2) are assigned a local cost of 5 since they are river spots.  The remaining five spots ((3,1), (4,1), (5,1), (5,2), and (5,3)) are assigned a local cost of 1 since they are plain spots.

The heuristic scoring for the remaining cost is calculated as it was in the previous case as simply the as-the-crow-flies distance; that is to say that no regard is paid to terrain cost or even if the spot is available for traversal.  It turns out that as long as all spots are scored on even footing with a common heuristic, the details of the scoring don’t matter.

Initially, the A* flirts with moving to the north-east since the local costs are low (being plains) and the total cost is low since the path is moving in the correct direction.  But quickly, the A* loses interest in these paths since their score, once opened, places them near the bottom of the open list.

After some exploration, the A* settles in on the path across the river, up along the western shore, through the forest to the north, and straight on to the east, thus bypassing the mountains and desert completely.

At this point, the A*, can pick either (7,5) or (7,6) as the next spot with no penalty, since both take the same amount of time in this geometry.  The selection for equally-scored spots is random so A* can meander a bit picking between any of the following paths:

  • (6,6) -> (7,6) -> (8,6) -> goal (9,6)
  • (6,6) -> (7,6) -> (8,5) -> goal (9,6)
  • (6,6) -> (7,5) -> (8,5) -> goal (9,6)
  • (6,6) -> (7,5) -> (8,6) -> goal (9,6),

since they are all equivalent.  For simplicity, I’ve coaxed the A* into the first of the list paths in the movie below but as far as our wayfarer is concerned, he gets home as safely by relying on A* as he does by his own skills as a world-weary traveler.

An A* Drama

Update 6/29/18: Please note that there is an error in this post.  There should be included in the reckoning of the cost of a space the total movement cost from the starting space to the space in question.  Please consult the Tweaks and Kinks in the A* post for details.

I suppose this blog post started first with some analysis I had decided to do on the A* algorithm and then it just grew from there.  I had employed the A* algorithm on a scheduling problem some years back.  This application was advocated by one of the developers on my team but I didn’t have the time to properly learn it during development and deployment.  I ended up using it as a black box with a silent vow to get around to learning it one of these days.

Years passed, and my son was learning it as part of a course he was taking and I thought now was the time.  So, I purchased AI for Game Developers, by David M. Bourg and Glenn Seemann and dug into Chapter 7 entitled A* Pathfinding.  If you happen to read their book, you’ll find that the approach is highly influenced by and patterned after their material.  To my credit, I slogged through a complete example and then animated in a nice little video.  If you are interested in that, it can be found at the bottom of this post.

The premise of the A* algorithm is that the starting point and ending point for some sort of journey are known but that the specific path from A-to-B, as it were, is not.  The personal picture I like to entertain is embarking on a car trip from New York to Los Angeles.  I know where I am starting and I know where I am to finish, but should I drive through Pennsylvania or Maryland or further south?  Does Kentucky figure into my route?  What about Idaho?  And so on.  The hope is to find the shortest path that connects the start with the end with only a general notion of how far apart the two are.

For simplicity, let’s consider a much simpler problem of a snake trying to find the man who poked him while he dozed on a warm rock in the summer sun.

Our slithering protagonist abstracts the terrain and lays down a grid, with himself starting at the grid point (4,2).  He sees that the man is at the grid point (11,4) so he begins to formulate his revenge.  However, being impatient to return to his sun-bathed nap our hero wants to dispatch the man as soon as possible.  He needs the shortest path.  Being mathematically minded and knowledgeable about algorithms, he decides on using the A* algorithm.

The A* algorithm consists of the following steps:

  1. Assume that the starting spot is open
  2. Take a look around at the nearest neighbor spots
  3. If they can be occupied, open them by
    1. Putting them on the open list for subsequent visit
    2. Score them with a perceived cost to move onto them
    3. Score them with an estimated cost representing how far they are from the destination square
    4. Total the two scores to give a total cost for the spot
    5. Assign a pointer from the open spot to the parent spot
  4. If they can’t be occupied (e.g. the are boundaries or forbidden) ignore them
  5. Close the starting spot
  6. Pick the lowest cost open spot on the list (picking arbitrarily if there are ties)
  7. Open and score any unopened spot
  8. Close the spot and repeat

If one of the open spots is the destination spot, don’t bother scoring it (i.e. give it a score of 0) and back-track from it to the start, following all the pointers.  The resulting path can then be traversed in the forward fashion.

For the simple example examined here, the perceived cost will be taken to be 1 for all possible spots, meaning the speed with which any spot can be crossed is the same.  In general, spots own a terrain cost that may make it more or less expensive than neighboring spots (the difference between a highway and a surface street each with dramatically different speed limits).

The estimated cost is evaluated using a heuristic that counts the spots between the open spot and the final spot with no regard to barriers or terrain.

Returning to our intrepid reptile, his spot at (4,2) is now shaded blue, meaning it is in the process of being closed.  He scores the spots at (3,2), (3,1), (4,1) and (5,1) and finds that the total costs are 9, 9, 8, and 7 respectively (the heuristic cost for (4,1) is 8 because it takes eight moves, horizontally, vertically, or diagonally to get to (11,4)).

Of the four spaces he has now opened, he selects (5,1) to next consider as its cost of 7 is the lowest of the bunch.   Spot (4,2) is now closed (shaded gray) and examining spot (5,1) then opens spots at (6,1) and (6,2).

Each of these has a score of 6 so he arbitrarily picks (6,1) to examine, opening the additional spots at (7,1) and (7,2).  Again, he picks arbitrarily between these two spots, since both own a score of 5 and are the cheapest members of the open list.

His full adventure can be found in the video below, including a frame-by-frame look at just how he tries to put his revenge in motion.  Spoiler:  the man gets what’s coming to him in the end.

To Great Lengths

It is well known that developers will go to great lengths to speed up algorithms or optimize code.  Performance, after all, is king.  In general, accelerating throughput comes in three flavors.

The first category covers all those really sophisticated ideas that come from a clear and insightful understanding of an algorithm or a physical model.

The examples that spring to mind in the case of the algorithm are the development of the solution to the traveling salesman problem or the invention of the Fast Fourier Transform.  In these cases, solutions to the problem existed but generalizing these so that they would perform well in an arbitrary setting proved to be challenging and elusive.  It took decades of many dedicated people’s times to wrangle out the techniques needed to transform the elementary examples into a working technique.

A keen example of how optimization for a physical model only comes from a deep understanding of the physics is the story that Peter Lepage relayed at a seminar that I attended.  The story is powerful and sticks with me to this day.  At the beginning of the seminar he took moment to set aside a laptop (1990s vintage) on which he started a ‘job’. He then proceeded to talk about the evolution of Quantum Chromodynamics (QCD) in the early days after the theory had been proposed.  Every few years, the team he was on would go to the NSF requesting more powerful super-computers with which to solve the QCD problem, giving in return more promises that the solution was just at hand.  After 3 or 4 iterations of this loop, Lepage, said that he realized that the team had lived through something like 5 progressions of Moore’s law.  The computing power had increased by a factor of roughly 32 and yet the solution remained out-of-reach.  He then decided that the fault, dear Brutus, laid not in their stars but in themselves in that their algorithm was all wrong (my paraphrasing of Shakespeare not his).  At this point, he interrupted the seminar and went over to the laptop.  Examining it, he declared that during the time he was speaking, a small program on his machine had solved a QCD problem and had ‘discovered’ the J/Psi particle and a quark ball.  He went on to explain that the performance gain he was able to create came not from better computing or large, more powerful computers, but from reworking the algorithm so that the physics (in this case renormalization) were properly considered.

In the second category, are the general optimization techniques that are taught in computer science.  These involve a careful examination of how to do common things like multiplying out a polynomial, sorting a list, convolving data and so on.  Counting operations and thinking about grouping and ordering are the usual ways of tackling these problems.

For example, a usual strategy to speed up polynomial evaluation is to refactor the usual way a polynomial is presented so as to minimize the number of multiplications.  The fourth-order polynomial

\[ a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 \]

transforms into

\[ a_0 + x ( a_1 + x ( a_2 + x ( a_3 + a_4 x) ) ) \; ; \]

far less readable but far more efficient.  The former case has 4 additions and 10 multiplies (assuming no calls to ‘pow’ are made).  The latter case still has 4 additions but only 4 multiplies, resulting in a huge savings if the polynomial is evaluated frequently.

Another example I encountered recently centered on the binning of a large list of floating point data into a smaller array.  A colleague of mine was using one of the standard scientific packages and he was amazed at the performance difference that resulted simply from changing the order of operations.

To illustrate, with a concrete case, consider the case where you have millions of floating point numbers that fall somewhere in the range between 0 and 8.  In addition, you have 8 bins, labeled B0 through B7 into which to place them.

While it may seem natural, a clumsy way to place the numbers within the bins would be to pick a given bin and then traverse the list of floats finding all those that belong in that bin and then starting with the next bin and so on.  This requires the traversal of the large list 8 times.   A better way would be to pick a number off of the list of floats and place it into the appropriate bin, requiring only 1 movement through the list.  These simple things make a huge difference but it isn’t always easy to tell how to massage a commercial piece of software into doing one versus the other.

At this point, you may be thinking that you understand the previous two categories.  While you may never have the insight to contribute to the first category, you can appreciate the results.  And you most likely can and will apply the techniques of the other in your day-to-day computing task.  So what exactly falls into the third category?

Well, the third category comprises all of those incredible ‘tricks’ that only result from equal parts inspiration (first category), standard algorithmic understanding (second category), pride, and outright desperation.

The poster child for this category is the ‘Fast Inverse Square Root’, an algorithm that was used to perform a high-performance estimate to $1/\sqrt{x}$, since this computation is the primary one needed to normalize vectors and compute scalar products for lighting in video games.

The Wikipedia article, which gives the history, context, and analysis of the algorithm is worth reading and maybe should be mandatory for all game designers and computer scientists.  I’ll not try to cover it here except to make a few observations.

First, the following figure, adapted from the Wikipedia article, shows the short set of lines needed to implement the algorithm

It may have been fast, but hardly maintainable.  The comments (most amusing) don’t reveal much in how the algorithm works.  Why, precisely, is there a cast from float to integer and what is the origin of the very specific hex value of 0x5f3759df, and to what algorithm do the lines decorated with 1st iteration and 2nd iteration refer?  Apparently, the second question haunted the developer who put in the comment that I sanitized for this venue.  Answers to all of these questions can be found in Wikipedia but it is worth pondering the mindset that led to the lines pictured above.

The second comment, a whimsical afterthought if you will, is the astonishing resemblance between John Carmack, credited with the fast inverse square root and Gary Cole, the actor who played Bill Lumberg in Office Space

a movie about the great lengths to which developers will go.

Logically the Best

In keeping with the New Years Theme, this month’s column lists the five most important accomplishments from the world of logic, mathematics, computing, or philosophy in the last 100 years.  This idea is an offshoot of the books Five Golden Rules: Great Theories of 20th Century Mathematics – and Why They Matter

and Five More Golden Rules: Knots, Codes, Chaos and Other Great Theories of 20th Century Mathematics,

both written by John Casti.  In these works, Casti devotes a chapter to each of the following:

  1. Minimax Theorem (Game Theory)
  2. Brouwer Fixed-Point Theorem (Topology)
  3. Morse’s Theorem (Singularity Theory)
  4. Halting Theorem (Theory of Computation)
  5. Simplex Method (Optimization Theory)
  6. Alexander Knot Polynomial (Knot Theory)
  7. Hopf Bifurcation (Dynamical System Theory)
  8. Kalman Filter (Control Theory)
  9. Hahn-Banach Theorem (Functional Analysis)
  10. Shannon Coding Theorem (Information Theory)

with items 1-5 being from the first book and 6-10 from the second.

The idea of ranking my own top 5 had been pinballing around the back of my mind since I first encountered these books in 2015 and when the opportunity presented itself for this month’s theme, I thought I would seize it and run.  But it is important to be aware that the list I’ve compiled here is my own, inspired by Casti’s work, but independent in both content and analysis.  In fact, I think Casti misses the two most important contributions even with a list of 10 – but that is a matter of judgement and taste.

So, without any more preamble, let’s get to the list.

  1. Kalman Filter

It is hard to overemphasize the Kalman filter’s importance to real time computing.  The algorithm allows for fitting and estimation to be performed, in the presence of noise and uncertainty, during the course of process, such as the motion of a robotic arm or the flight of a spacecraft.  The filter takes in measured points as they become available and ‘re-fits’ the new point in the context of all previous points thereby giving an improved estimate of the state of the system.  The key point is that the system directly takes into account both the uncertainty in the measurements as well as the short-comings of any physical model of the motion.  The introduction of the Kalman Filter gave rise to a whole industry of algorithm development (extended Kalman Filters, unscented filters, Bayesian networks, etc.) that balance model-based prediction (e.g. Newton’s laws) with uncertainty in knowledge and measurement – the essential ingredient for computation in the real world.  Final note:  Casti also recognizes the Kalman filter.

  1. Optimization Techniques

This entry is a catch-all for a variety of algorithms and approaches that have been stirred into action with the advent of the computer.  While not particularly glamourous, optimization methods solve a host of real world problems associated with getting the best; as long as best can be mathematically encoded in something called an objective function.  There are two basic families of optimization algorithms:  indirect and direct.  In the indirect method, additional equations, typically called adjoint equations, are added to the problem to be solved and the response of the new variables in these equations indicates when controls should be applied and how.  In the direct methods, a set of controls are assumed at the onset and their values are adjusted to achieve optimality.  Optimization techniques find application in thousands of practical problems including trajectory optimization, operations research, and economic modeling.  Final note: Casti singles out the simplex method for special recognition.  While this method is of particular importance in both the history of optimization and the particular application to economics, its limitation to linear problems is a significant liability, hence the larger scope of this entry.

  1. Shannon’s Mathematical Model of Information

In 1948, Claude Shannon published A Mathematical Theory of Communication in The Bell System Technical Journal.  This paper, considered a landmark, set the stage for the modern digital age and for the advances in the field of telecommunications (wired and wireless).  Building on the ideas of Nyquist and Hartley, Shannon essentially created the field of information theory and the digital signal processing that follows in its train.  His theory defined the notion of information (a bit), sets the theoretical limits on the amount of information that can be transfers, adapted the idea of entropy to communications, and provided a set of theorems that defined what could and could not be done in a communications system.  Final note: Casti recognizes the importance of Shannon’s work in his second book and increased my own recognition of the importance of Shannon’s work.

  1. Fast-Fourier Transform

Deemed as one of the most important numerical algorithms ever created, the fast fourier transform or FFT is the workhorse of the modern digital age.  There are so many applications that it is hard to even list a fraction of them but some of the most important ones are:  1) signal processing in telecommunications, 2) image construction in medical imaging, and 3) spectrum analysis of scientific data ranging from the mechanics of materials to response of electrical components to quantum mechanics.  An FFT system resides in virtually every home in the US in the high definition televisions, cellphones, videogame systems, and so on.  Final note:  Casti has no mention that I’ve found of the FFT.

  1. Gödel’s Theorem

Rounding out the top of this list is the far-reaching analysis on the limit of logic (mathematical and otherwise) that Kurt Godel provided the world in 1931.  This theorem, along with the closely related theorems due to Tarski (undefinability theorem), Church (undecidability) , and Turing (halting theorem) states that formal logic must be either incomplete or inconsistent.  All of these related theorems deal with how logic views or expresses statements about itself and have profound implications on human thought, philosophy, and artificial intelligence.  I also interpret them as saying something essential about man as a rational animal and, therefore, about the human condition as I’ve written in earlier posts (Gödel’s Theorem – General Notions, Turing, Godel, and the Universe, Logic and Limits, Bare Bones Halting, and Self-Reference and Paradoxes).  The full content of these results will no doubt be explored for years to come.  Final note:  Casti focuses more on the halting theorem and its implications for computing.

Throwing Darts

Riding a bike, catching a ball, driving a car, throwing a dart – these and countless other similar, everyday activities involve the very amazing human capacity to target.  This capacity has yet to be duplicated in the machine world even in its most sophisticated applications.

To expand on this concept, take dart throwing as the prototype for analysis.  In the act of hitting a bullseye, a brain, an eye, a hand, a dart, a dartboard, a throw, and a landing all come together to produce a spectacular trajectory where the missile hits exactly where the thrower intends.

How does the person learn to throw accurately?  The equations governing the flight of the dart are nonlinear and subject to a variety of errors.  Clearly practice and feedback from eye to brain to hand are required but how exactly does the training take hold?

While some believe that they know the answer, an honest weighing of the facts suggest that we don’t have more than an inkling into the mechanism by which intelligence and repetition interact to produce a specific skill.  In Aristotelian terms, the mystery lies in how a human moves from having the potentiality to score a bullseye into being able to do it dependably.

What does seem to be clear is that this human mechanism, whatever it actually is, is quite different from any of the mathematical processes that are used in machine applications.  To appreciate just how different, let’s look at one specific mathematical targeting method: differential correction.

Differential correction is the umbrella term that covers a variety of related numerical techniques, all of them based on Newton’s method.  Related terms are the secant method, Newton-Raphson, the shooting method, and forward targeting.  Generally speaking, these tools are used to teach a computer how to solve a set of nonlinear equations (e.g., the flight of a dart, the motion of a car, the trajectory of a rocket, etc.) that symbolically can be written as

\[ \Phi: \bar V \rightarrow \bar G \; ,\]

where $\bar V$ are the variables that we are free to manipulate and $\bar G$ are the goals we are trying to achieve.  The variables are expressed in an $n$-dimensional space that, at least locally, looks like $\mathbb{R}^n$.  Likewise, the goals live in an $m$-dimensional space that is also locally $\mathbb{R}^m$.  The primary situation where these spaces deviate from being purely Cartesian is when some subset of the variables and/or goals are defined on a compact interval, the most typical case being when a variable is a heading angle.

These fine details and others, such as how to solve the equations when $m \neq n$, are important in practice but are not essential for this argument.  The key point is that the mapping $\Phi$ only goes one way.  We know how to throw the dart at the board and then see where it lands.  We don’t know how to start at the desired landing point (i.e., the bullseye) and work backwards to determine how to throw.   Likewise, in the mathematical setting, the equations linking the variables to the goals can’t be easily inverted.

Nonetheless, both the natural and mathematical targeting processes allow for a local inversion in the following way.  Once a trial throw is made and the resulting landing point seen and understood, the initial way the dart was thrown can be modified to try to come closer.  This subsequent trial informs further throws and often (although not always) the bullseye can be eventually reached.

To illustrate this in the differential correction scheme, suppose the flight of the dart depends on two variables $x$ and $y$ (the two heading angles right-to-left and up-to-down) and consider just how to solve these two equations describing where the dart lands on the board (the values of $f$ and $g$):

\[  x^2 y = f \; \]

and

\[ 2 x + \cos(y) = g \; . \]

To strike the bullseye means to solve these equations for a specified set of goals $f^*$ and $g^*$, which are the values of the desired target.  There may be a clever way to solve these particular equations for $x$ and $y$ in terms of $f$ and $g$, but one can always construct so-called transcendental equations where no such analytic solution can ever be obtained (e.g. Kepler’s equation), so let’s tackle them as if no analytic form of the solution exists.

Assuming that we can find a pair of starting values $x_0$ and $y_0$ that get us close to the unknown solutions $x^*$ and $y^*$, we can ‘walk’ this first guess over to the desired initial conditions by exploiting calculus.  Take the first differential of the equations

\[ 2 x y dx + x^2 dy = df \]

and

\[ 2 dx + \sin(y) dy = dg \; .\]

Next relax the idea of a differential to a finite difference – a sort of reversal of the usual limiting process by which we arrive at the calculus in the first place – and consider the differentials of the goals as the finite deviations of the target values from those achieved on the first throw:  $df = f_0 – f^*$ and $dg = g_0 – g^*$.  Likewise, the differential $dx = x_0 – x^*$ and $dy = y_0 – x^*$.   Since we now have two linear equations in the two unknowns $x^*$ and $y^*$, we can invert to find a solution.  Formally, this approach is made more obvious by combining the system into a single matrix equation

\[ \left [ \begin{array}{cc} 2 x y & x^2 \\ 2 & \sin(y) \end{array} \right] \left[ \begin{array}{c} x_0 – x^* \\ y_0 – y^* \end{array} \right] = \left[ \begin{array}{c} f_0 – f^* \\ g_0 – g^* \end{array} \right] \; .\]

The solution of this equation is then

\[ \left[ \begin{array}{c} x^* \\ y^* \end{array} \right] = \left[ \begin{array}{c} x_0 \\ y_0 \end{array} \right] – \left [ \begin{array}{cc} 2 x y & x^2 \\ 2 & \sin(y) \end{array} \right]^{-1} \left[ \begin{array}{c} f_0 – f^* \\ g_0 – g^* \end{array} \right] \; . \]

Since the original equations are non-linear, this solution is only an estimate of the actual values and, by taking these estimates as the improved guess, $x_1$ and $y_1$, an iteration scheme can be used to refine the solutions to the desired accuracy.

Everything rests on two major assumptions:  1) a sufficiently close first guess can be obtained to seed the process and 2) the equations are smooth and can be differentiated.  By examining each of these assumptions, we will arrive at a keen distinction between machine and human thought.

First the idea of a first guess is so natural to us that it is often hard to even notice how crucial it is.  Just how do we orient ourselves to the dartboard.  The obvious answer is that we look around to find it or we ask someone else where it is.  Once in sight, we then lock our attention on it and finally we throw the dart.  But this description isn’t really a description at all.  It lacks the depth needed to explain it to a blind person let alone explain it to the machine.  In some fashion, we integrate our senses into the process in a high non-trivial way before even the first dart is thrown.

Returning to the mathematical model, consider how to find $x_0$ and $y_0$ when $f^* = 2.5$ and $g^* = 4.0$.  The traditional way is to make a plot of the mapping $\Phi$, which may be considered as taking the usual Cartesian grid in $x$ and $y$ and mapping it to a set of level curves.  A bit of playing with the equations leads one to the plot

where level curves of $f$ are shown in black and those for $g$ are in blue.  The trained eye can guess a starting point at or around (1.5,1.0) but, for the sake of this argument, let’s pick (1.0,0.5) (the green dot on the figure).  In just a handful of iterations, the differential correction process hones in on the precise solution of (1.57829170833, 1.00361110653) (the black dot on the plot).  But the mystery lies in the training of the eye and not in the calculus that follows after the first guess has been chosen.  What is it about the human capacity that makes looking at the plot such an intuitive thing?  Imagine trying to teach a machine to accomplish this.  The machine vision algorithms needed to look at the plot and infer the first guess still don’t exist, even though the differential correction algorithm is centuries old.

Even more astonishing, once realized, is that the human experience neither deals with symbolically specified equations nor are the day-to-day physical processes for throwing darts, driving cars, or any of the myriad other things people do strictly differentiable.  All of these activities are subjected to degrees of knowledge and control errors.  Inexactness and noise pervade everything and yet wetware is able to adapt to extend in a way the software can’t.  Somehow the human mind deals with the complexity in an emergent way that is far removed from the formal structure of calculus.

So, while it is true that the human and the machine both can target and that targeting entails trial-and-error and iteration, they do it in completely different ways.  The machine process is analytic and local and limited.  The human process is emergent, global, and heuristic.  In addition, we can teach ourselves and others how to do it and even invent the mathematics needed to teach a pale imitation of it to computers, but we really don’t understand how.

Logic and Limits

For those of us who grew up in the seventies and had an interest in things other than sex, drugs, and rock and roll, Star Trek was a mainstay.  The characters, the visuals, the storylines all inspired a generation of scientists, engineers, and explorers.  And no other character in Star Trek epitomized a bright new future full of exciting technology more than did Mr. Spock.  As a Vulcan, his strong embrace of logic influenced many of us to value reason and clear thinking ever since.

But a curious aspect of Vulcans, in general, and Mr. Spock, in particular, is their illogical devotion to logic.  Although the original show frequently emphasized that logic wasn’t enough (often through the intuition or hunches of Kirk, McCoy, or Scott) and that the human experience transcends logic, there is no portrayal that the Vulcans ever examine the basis of logic.  As a race, they never seem to engage in meta-logic, the logical look at the underpinnings of logic and an exploration of what it can do well and where it fails.

Clearly this point is not the central concern for the show nor should it be.  Nonetheless, it is fun to speculate what Spock might have done should the writers had known about the limits of logic.  I’m not speaking of something as sophisticated as the implications of Gödel’s theorem.  Rather, I am talking about the simple and obvious issues associated with logic.  The need for reasoning to be based on the finite knowledge of beings, the need for first principles or axioms, and the boot-strapping needed to go from a guess to a hypothesis to a theory with strong evidentiary support.  In short, inductive reasoning.

Induction is the key pillar of scientific advances; it is the method by which all knowledge of the physical world is explored and gathered.  Induction is inherently probabilistic in the way it tries to link cause and effect (think of the number of times Kirk asked Spock for the odds that such-and-such a thing would happen) and has been remarkable successful in helping mankind grope its way to a better understanding of the physical world.  But despite its success, its logical foundations are weak.

A nice discussion about the issues associated with inductive reasoning is found in Chapter 5 of Introduction to Logic, by Harry J. Gensler and I’ve cited Gensler and his discussions in several posts in the past.

What struck me about the arguments found towards the end of Chapter 5 is the remarkably fragile way that the whole logical structure holds together when examined closely.  In other words, logic examining logic finds itself disappointing in two areas – the proper formulation of induction principles and the justification for believing an inductive argument.

The element of probability inherent in induction can lead to some absurd and laughable conclusions.  Gensler offers this statistical syllogism by way of illustration.

60 percent of All Cleveland voters are Democrats
This non-Democrat is a Cleveland voter
This is all we know about the matter
Therefore, it’s 60 percent probable that this non-Democrat is a Democrat

Of course, this is a nonsense argument; but before being quick to dismiss it consider these three areas where arguments like this are not easy to spot and can do damage.  First is the student (at any age) who is learning a subject for the first time.  Unless the entire curriculum is memorization, somewhere along the line the student will need to infer something based on the knowledge in hand.  This step is usually where beginners make their mistakes – jumping to an unwarranted conclusion precisely because their induction skills (their intuition) haven’t been honed within the broader context.  This leads to frustration and, far to often, abandonment of the subject.  Extending this idea to machine-based inference engines, ones whose intuition can’t be improved through learning (at least not in the way humans learn), brings to the forefront the fact that the need for rigorous laws of inductive inference is far more acute.  Even though it makes for engaging science fiction, we don’t want a machine in our lives coming up with an absurd conclusion just because of sloppy rules.  And finally, large components of society are constantly being grifted by arguments of the Cleveland-voter sort.  We see it in politics, advertising, and all the other sophistical interactions surrounding us.  To quote P.T. Barnum, “There’s a sucker born every minute.”  Unfortunately, as Gensler goes to great pains to point out, there doesn’t seem to be a way to make the rules of inductive inference tighter.  Even principles, like Occam’s Razor, are merely guidelines that must be applied with judgement.  Or, as I like to phrase it, there doesn’t seem to be a way to make inductive logic ‘sucker-proof’.

Even more interesting is the fact that justification of inductive arguments seems to be self-referential and circular.  Gensler discusses the criticism leveled by David Hume against induction and the five responses that have arrived over the intervening years.  I won’t go over all five in detail but will confine the discussion to the three most interesting ones.

  1. Justification for induction can be done by presuming nature is uniform, meaning that it works in regular patterns. I happen to favor this justification even though Gensler is correct in insisting that it is hopelessly vague.  Curiously, his most interesting observation is that the idea that nature is uniform is based on arguing from prior experience.  This argument is inductive and thus we have induction justifying induction.
  2. Closely related is the idea that inductive methods have been so profoundly successfully in every aspect of life that they must be correct. Of course, this is another circular argument in which an inductive syllogism is used to justify the inductive method.
  3. The final response Gensler cites is the most interesting – that we approach the justification of inductive logic in the same way we approach the justification for deductive logic. He singles out modus ponens as that particular part of deductive logic worth examining.  Why does modus ponens work and why should we believe it?  This last question is most germane when considering the truth table for modus ponens where the premise is false but the consequent is true ($A \rightarrow B$ is true if $A$ is false but $B$ is true).  In some sense this is an arbitrary convention (although a reasonable one – if I can use a very vague word).  So why should we expect modus ponens to work?  How do we justify it?  He first says that we might try using this syllogism
    If the truth table for modus ponens never gives true premises and a false conclusion, then modus ponens is valid.
    The truth table for modus ponens never gives true premises and a false conclusions.
    Therefore, modus ponens is valid.

    But here we have a problem, we are using modus ponens to justify modus ponens.  The justification, even in the firm ground of deductive logic, leaves something to be desired.

In the end, it seems that we have to simply accept that logic has its limits and that, as a race, we’ve bootstrapped our way to our present level of knowledge.  This holds for inductive and deductive logic, with the former providing the basis for the latter (i.e., experience has taught us that deductive logic axioms are the ones we need to assume to get the whole thing to work).  Gensler concludes with:

Inductive reasoning has been very useful.  Instructively, we assume that it will continue to be useful.  In our lives, we can’t do without it.  But the intellectual basis for inductive reasoning is shaky.

It would have been interesting to hear Mr. Spock discuss these points, to argue these a bit with the more intuitive Kirk.  But perhaps a serious examination would have undermined the entire purpose that he served on the show.  Perhaps it would have undermined the very reason for Vulcan to be, and maybe that would have been an episode worth seeing.

Logic and Politics

There is a widespread belief about politicians and used car salesmen that both species can often speak with a lot of gusto, using sizeable portions of the English language, without actually saying much or anything at all.

Sometimes there is a certain charm in circumlocution. For example, the legendary speech by Noah ‘Soggy’ Sweat Jr. certainly can be whittled down to ‘it all depends’ but at the loss of a great deal of fun and verve.

On other occasions, ambiguity and double-speak can be misleading or even down-right deadly. An excellent example of such a situation is in the following vivid exchange taken from Isaac Asimov’s Foundation.

For those unfamiliar with this recognized science fiction classic, the story is set far in the future of mankind in which a vast galactic empire is beginning its decline. The Foundation is a large colony of scientists and researchers who the empire has exiled to the galactic fringe in order that they may compile an encyclopedia that catalogs the knowledge of the race in advance of the collapse. In actuality, the empire has been tricked into forcing their exile so that the colony can become the nucleus around which the next empire coellesces.

The out-of-the-way planet that the Foundation calls home sits within a set of smaller star systems that are breaking away from the empire as the latter’s influence and dominion recedes. The following excerpt comes from a strategy meeting where the Foundation’s leaders are trying to determine how to respond to the ulimatum they’ve just received from Anacreon, the largest of the breakway states, in light of recent diplomatic visits of the delegations from Anacreon and the Galactic Empire.

The exchange starts with Salvor Hardin, the Foundation’s politically-savy mayor, trying to convince the board of Trustees, who oversee the completion of the encyclopedia, just how precarious their situation is. Hardin’s position is that the Board’s appeal to the Empire for protection was the cause of the threat from Anacreon.

Said Yate Fulham: “And just how do you arrive at that remarkable conclusion, Mr. Mayor?”

“In a rather simple way. It merely required the use of that much-neglected commodity – common sense. You see, there is a branch of human knowledge known as symbolic logic, which can be used to prune away all sorts of clogging deadwood that clutters up human language.”

“What about it?” said Fulham.

“I applied it. Among other things, I applied it to this document here. I didn’t really need to for myself because I knew what it was all about, but I think I can explain it more easily to five physical scientists by symbols rather than by words.”

“The message from Anacreon was a simple problem, naturally, for the men who wrote it were men of action rather than men of words. It boils down easily and straightforwardly to the unqualified statement,…,which in words, roughly translated, is, ‘You give us what we want in a week, or we take it by force.'”

“All right.” Hardin replaced the sheets. “Before you now you see a copy of the treaty between the Empire and Anacreon – a treaty, incidentally, which is signed on the Emperor’s behalf by the same Lord Dorwin who was here last week – and with it a symbolic analysis.”

“As you see, gentlemen, something like ninety percent of the treaty boiled right out of the analysis as being meaningless, and what we end up with can be described in the following interesting manner:

“Obligations of Anacreon to the Empire: None!

“Powers of the Empire over Anacreon: None!”

Later on the group discusses a similar analysis of the visits from the empire’s representative Lord Dorwin

“You know, that’s the most interesting part of the whole business. I’ll admit I had thought his Lordship a most consummate donkey when I first met him – but it turned out that he was actually an accomplished diplomat and a most clever man. I took the liberty of recording all his statements.”

“… The analysis was the most difficult of the three by all odds. When Holk, after two days of steady work, succeeded in eliminating meaningless statements, vague gibberish, useless qualifications – in short, all the goo and dribble – he found he had nothing left. Everything canceled out.”

“Lord Dorwin, gentlemen, in five days of discussion didn’t say one damned thing, and said it so you never noticed.”

Not all applications of symbolic logic are as dramatic and interesting as the one Asimov depicts. Nonetheless, even though there may not be any cosmic significance, symbolic logic can be a tool that makes life easier and reasoning and comprehension more clear.

Suppose, for example, that you have a friend who says

If it is raining and either it is not raining or it is snowing then it is snowing.

What do you make of that statement. What does it mean? Does it ever make sense? Trying to parse his sentence is nearly impossible – at least in its fully decorated language form. Symbolic logic let’s us, much like Mayor Hardin, strip away all the nonsense and come to some supported conclusion about your friend’s ability to communicate.

To apply it, first take the basic pieces and represent them with simple symbols. For this example, let $p$ mean ‘it is not raining’ and $q$ ‘it is snowing. Your friend’s cryptic statement is symbolically represented as:

\[ \left[ \neg p \wedge (p \vee q) \right] \rightarrow q \; ,\]

where $\neg$ means not (i.e. $\neg p$ means it is raining), $\wedge$ is the logical and, $\vee$ is logical or, and $\rightarrow$ is the usual if-then relation.

Having translated the cryptic statement into symbols, we can now manipulate in terms of the standard rules of propositional logic.

The first step is to rewrite the if-then implication in its ‘or’ form

\[ \neg \left[ \neg p \wedge (p \vee q) \right] \vee q \; . \]

Then use de Morgan’s rule to bring the negation inside

\[ \left[ \neg \neg p \vee \neg (p \wedge q) \right] \vee q \; . \]

Next use the double negation to simplify the first term

\[ \left[ p \vee \neg(p \wedge q) \right] \vee q \; \]

and then use de Morgan’s rule again

\[ \left[ p \vee (\neg p \vee \neg q) \right] \vee q \; . \]

The law of distribution is the next step

\[ \left[ (p \vee \neg p) \vee (p \vee \neg q) \right] \vee q \; . \]

From the basic laws of logic $p \vee \neg p \equiv T$ since a proposition is either true or false so that that same proposition or not that proposition is always true (a tautology). This observation yields

\[ \left[ T \vee (p \vee \neg q) \right] \vee q \; . \]

Next apply $T \vee p \equiv p $, and our original statement is now

\[ (p \vee \neg q) \vee q \; , \]

which in English reads something like ‘either it is raining or it is not snowing or it is snowing’. Still a bit confusing to parse but certainly much simpler. Fortunately, we can continue analyzing the statement further.

Using distribution again gives

\[ (p \vee q) \vee (\neg q \vee q ) \; , \]

which becomes

\[ (p \vee q ) \vee T \]

when $\neg q \vee q \equiv T$ is used.

Finally, as noted earlier, $ anything \vee T \equiv T$ and the original statement boils down always being true. Your friend has uttered a tautology and, for one brief moment, shown himself to be worthy of being called ‘an accomplished diplomat and a most clever man’ who is able to avoid saying ‘one damned thing’ and of saying ‘it so you never noticed’.

Interfaces and Thoughts

This month’s column is, on the face of it, a whimsical undertaking.  I’ll be examining the man-machine interface on two different cars – a 2004 Saturn Ion and a 2000 Infiniti I30.  Yes, you are correct!  I own some old cars and, just in case your thoughts wander into the notion that at least there is a luxury car in the mix, please note that the Infiniti was purchased used, with over 100,000 miles on it.  Anyway, the central theme of this post is not what kind of cars I own and drive but rather what the design of the car tells about the designer.

The idea that the form and function of a work reveals a portion of the mind of the designer is old, very old, and is one of the central arguments from the medieval synthesis for the existence of God.  I’m not claiming any deep metaphysical insights from the discussion here but I do think there are interesting reflections on the human mind in this very tongue-in-cheek analysis.

To start, let me say that the Infiniti is the better designed car from the point of view of acceleration, handling, and space.  It has a V8 compared to the Ion’s 4 cylinder inline.  It has leather seats versus the Saturn’s cloth ones.  And the list could go on.  Nonetheless, the man-machine interface in the Saturn is hands down better.

Let’s start with the power locks on the doors.  Here is a picture of the locking control for the I30:

and the corresponding look at the Ion:

While they are very close in design they don’t work the same way at all – leading to a great deal of confusion whenever I switch.  The icon of the black key on the white background means unlock on the Saturn and lock on the Infiniti.  And that difference speaks to how the people who designed these controls think.

I am quite aware that there are cultural differences between the Japanese and US mind but these shouldn’t have come into play.  That isn’t to say that they aren’t important or valid differences nor that the resulting man-machine interface isn’t influenced by them but rather there should be universal agreement across both designs.

The primary purpose of any key, regardless of the time and place in which it is used, is to unlock a locked door.  In certain, older circumstances keys are also used to lock the door again but this lesser purpose should not be the one that the designers seized for their manufacturing.  This is because, for reasons of safety and security, a car is something that is generally in a locked state unless being entered or exited.  Once in a car, especially one that has power locks, the notion of key as the locking agent becomes almost meaningless.  In addition, cars are imported and exported around the world and international standardizations are commonplace.  Thus, the only possible conclusion is that the Ion gets it right and the Infiniti gets its wrong.  This conclusion also suggests that the Infiniti designers were perhaps focused on the Japanese interpretation and not on how their product would be used in the global market.

Of course, there are those of you who still object that this is simply a matter of convention; that the functional argument is not persuasive.  I must admit that when I first encountered this difference I wasn’t swayed either.  The interface that really draws the difference and pushed my thinking from equivocation to certainty is the windshield wiper control.  Here the problem isn’t embracing a difference between conventions but a matter of internal consistency.  And again, the Saturn is the clear winner.

To be concrete, here is a photo of the interface to the wiper system from the Ion

and the I30

Again, the two designs look very similar – too similar, in fact, to keep me from getting confused.  Both designs sport the traditional iconography of a dashed line (- -) for intermittent wipers, the solid line (—) for continuous low, and a heavy solid line for continuous high ().  A careful examination reveals that the directions one must articulate the control is different; on the Ion, the wipers go from intermittent to low to high by pushing the lever upwards while on the I30 the same sequence results by pushing down.  Again, the difference seems to be one of convention but we haven’t discussed the intermittency setting and it is here that the I30 shows itself to be inconsistent.

Before getting to the inconsistency, there is one more matter of convention that differs between the two.  Both controls sport a dial bearing graduated lines (the white lines on the right that are wider at the top and taper to almost nothing at the bottom). that set the speed of the intermittent wipers.  For the I30, the larger the line the larger the time gap between successive swipes by the wipers.  For the Ion, the larger the line the smaller the time gap between successive swipes.  So their conventions are dual to each other, with the I30 working in terms of time and the Ion in terms of frequency.

The inconsistency rears its head when the lever and dial are used in tandem (i.e. when using intermittent wipers).  In the I30, higher frequency is obtained by pushing the lever down but by turning the dial up.  On the Ion, up means the same thing for both lever and dial.  And that, in a nutshell, is why the man-machine interface of the Ion is better than that of the I30, despite the I30 being a better car overall.

So, what do these design choices reveal about the minds of the designers.  In the case of the Ion, it seems to show that there was one single guiding mind or principle.  Whether you prefer the design choices of the Ion over the I30, there is no arguing that the Ion design is self-consistent.  White icons always mean changing the state to something more active from the previously passive state.  For the doors from locked (passive) to unlocked (active); for the wipers from off (passive) to low frequency (active) to high frequency (more active).  In the case of the I30, the design is a hodge-podge of concepts with differing motifs and little consistency.  This suggests that no single guiding principle that knitted the design of the man-machine interface.  Part of this is, no doubt cultural, but part of it seems to be indicative of a company that puts pride in the subsystems but fails to put as much emphasis on knitting the systems together in a seamless whole.

K-Means++ Data Clustering: Improved Seeding

This month’s column is a follow on to last month’s look at the K-means data clustering.  Again, I’ll be looking at James McCaffrey’s Test Run article K-Means++ Data Clustering, published in MSDN in August 2015.   

In the last post, I confined my analysis to the K-means data clustering algorithm using McCaffrey’s hand-picked data points that showed that the heights and weights of a sample of some population aggregated into 3 distinct groups of clusters.  Once the data were normalized, the human eye easily picks out these clusters but the machine, not so much.  The purpose of the K-means clustering was to provide the machine with a mechanism where it might mimic, at least in the most primitive of senses, the ability of the human eye to correlate a 2-dimensional image.

Of course, McCaffrey’s sample was designed specifically to illustrate the techniques with no gray areas that might confuse the person, let alone the computer.  Nonetheless, the algorithm failed to find the clusters completely around 20-30% of the time; an estimate of my own based on a semi-empirical method of restarting the algorithm a number of times and simply counting the number of times that the clustering gave a color-codes result that looked like the following figure (each color denoting a cluster).

The problem with the algorithm is one of seeding the process.   For each of the $k$ clusters chosen as a guess by the user, the algorithm needs to start with a guess as to where the centers of those clusters lie.  The K-means algorithm seeds the process by selecting samples from the distribution at random to give the initial guess as to the cluster centers.  If this random guess selects two nearby points as two distinct centers then a situation like the one shown above arises.

The ‘++’ in the K-means++ algorithm denotes an improvement to the seeding process to avoid this undesirable outcome.  The algorithm is called roulette wheel selection and it works as follows.

  1. The first center is chosen at random
  2. The distance of each remaining point, $d_{i,1}$ from this point is chosen, typically using the Euclidean norm
  3. An array is formed with the cumulative sum of the $d_{i,1}$ normalized by the total $D = \sum_i d_{i,1}$ and the ith element of the array is associated with the corresponding point
  4. A random number is drawn from a uniform distribution and the first point in the array whose cumulative probability is greater than the drawn random number is chosen as the second center
  5. Repeat steps 2-4 as before but using the new center rather than the old one

The roulette wheel selection implementation is not particularly hard but the are questions about performance and rather than worrying about those details, I chose to use the Kmeans roulette wheel selection implementation provided in scikit-learn since it is widely accepted and I eventually threw much larger datasets at the algorithm than the 20-element height/weight data.  Beyond the obvious import of the package (from sklearn.cluster import KMeans), the key lines for the implementation were:

kmeans_cluster = KMeans(n_clusters=<k>, n_init=<n>).fit(<data>)
point_labels = kmeans_cluster.labels_

The parameters are:

  • n_clusters = <k> sets the number of clusters (aka centers or centroids) the data are believed to possess to the user-supplied value of <k>
  • n_init = <n> sets the number of completely separate tries that the algorithm gets at the clustering to the user-supplied value of <n>
  • <data> are the set of values in N-dimensions that are clustered

Before discussing the results, it is worth saying a few words about the n_init setting.  The default value in KMeans is 10, meaning that scikit-learn’s implementation start with fresh seeding of the clusters on 10 separate runs through the data and will keep the best one.  Since I was interested in how well the initial seeding using the Kmeans++ approach (roulette wheel selection) performed versus the traditional Kmeans approach (random selection), I set n_init = 1 for the initial testing.  Finally, I wrapped the whole process of seeding, clustering, and producing a figure into a loop with everything reset at the beginning of each pass.  In 100 trials, with n_clusters = 3, n_init = 1, and <data> being the height/weight data as before, the algorithm correctly picked out the 3 distinct clusters each time.  The randomness of the seeding was verified by the cluster color coding changing at each iteration.

While this test clearly demonstrated the improvement that the roulette wheel selection provides, it is worth noting that the seeding can’t be perfect since the algorithm can’t determine the number of clusters to begin with (the value of <k>) nor can it guarantee that the seeding process will always work (n_init = 10 as the default). 

Indeed, Kmeans++ can often be fooled even when the number of the clusters is obvious and they are quite distinct from each other.  Consider the following set of points

that form two distinct clusters of random realizations in two dimensions (the red from two different-sized Gaussians and the green from a tight Gaussian along the radial direction and a uniform distribution in the azimuthal direction).

Sadly, scikit-learn’s Kmeans implementation (with n_init = 10) has no apparent ability to find these clusters as seen in the following comical figures

This example was derived from one of the discussion points raised in the Stack Overflow post machine learning – How to understand the drawbacks of K-means, which has lots of other examples of Kmeans coming up short.

Just for fun, I also asked Kmeans to cluster these points using n_clusters = 3 and got

which goes to show that the algorithm is really intelligent. 

To close, consider the following two points.  First, as has been pointed out by many people in many forums, no algorithm, including Kmeans++ is bulletproof.  The aphorism

When averaged across all possible situations, every algorithm performs equally well. – attributed to by Wolpert and Macready

best summarizes the fact that Kmeans should be only one of many tools in an ever-growing toolbox.

Second, on a humanistic vein, consider the marvelous attributes of the human being that allows a person to glance at the above figure and not only identify the number of clusters but that can also find their edges or conclude that the task is hopeless and go off for a snack.  That’s true intelligence.

K-Means Data Clustering

This month’s column is inspired by James McCaffrey’s Test Run article K-Means++ Data Clustering, published in MSDN in August 2015.   McCaffrey’s piece is really two articles rolled into one and is interesting not only for the data clustering algorithm that it presents but also for the fascinating light it sheds on the human facility to visually analyze data (not that McCaffrey comments on this point – it’s just there for the careful observer to note).  But before getting to the more philosophical aspects, it’s necessary to discuss what K-means data clustering is all about.

To explain what data clustering entails, I will use the very same data the McCaffrey presents in his article as the test set for mine.  He gives a set of ordered pairs of height (in inches) and weight (in pounds) for 20 people in some sampled population.  The table listing these points

Height
(in)
Weight
(lbs)
65.0 220.0
73.0 160.0
59.0 110.0
61.0 120.0
75.0 150.0
67.0 240.0
68.0 230.0
70.0 220.0
62.0 130.0
66.0 210.0
77.0 190.0
75.0 180.0
74.0 170.0
70.0 210.0
61.0 110.0
58.0 100.0
66.0 230.0
59.0 120.0
68.0 210.0
61.0 130.0

doesn’t reveal anything special at a glance; nor should it be expected that it would.  As every good data scientist knows, the best (human) approach is to plot the data so that the eye can pick out patterns in that the mind may fail to perceive in the tabulated points.  The resulting plot

reveals that there are three distinct clusters for the data. McCaffrey doesn’t show a plot of these data but it seems that he omitted it because he wanted to focus on how to teach the computer to find these three groups using K-means clustering without the benefit of eyes and a brain, which it obvious doesn’t have.

The principle behind the K-means clustering is quite simple:  each group of points is called a cluster if all the points are closer to the center of the group than they are to the centers of the of the other groups.

The implementation is a bit more difficult and tricky.  Distance plays a key role in this algorithm but whereas the human eye can take in the entire plot at one go and can distinguish the center of each cluster intuitively, the computer must calculate each point one at a time.  This leads to two problems:  1) how to numerically calculate distances between points and 2) how to find the centers of the cluster from which these distances are measured.  Let’s deal with each of these.

The fact that the units for the x-axis are different from those on the y-axis can lead to some problems identifying the clusters – for both the human and the machine.  Consider the following plot of the same data but with the x- and y-axes given the same extent from 50 to 200 in their respective units:

It isn’t at all clear that there are three distinct clusters as opposed to two.  A better approach is to normalize the data in some fashion.  I prefer using the Z-score where the data are normalized according to

\[ Z_x = \frac{ x – <x>}{\sigma_x} \; , \]

where $<x>$ is the mean over all the values and $\sigma_x$ is the corresponding standard deviation.  Applying the normalization to the height and the weight gives

Once the data are normalized, the distance between them is determined by the usual Euclidean norm.

The second step is to determine the centers of each cluster, which is essentially a chicken-and-the-egg problem.  On one hand, one needs the centers to determine which point belongs to which cluster.  On the other hand, one needs to have the points clustered in order to find their center.  The K-means approach is to guess the centers and then to iterate to find a self-consistent solution.

I implemented the K-means clustering code in a Jupyter notebook using Python 2.7 and numpy.  I defined 2 helper functions.

The first one determined the cluster into which each point was classified based on the current guess of the centers.  It loops are all points and compares each point’s distance from the k centers.  The point belongs to the closest center and the point is assigned the id number closest center in a list, which is returned.

def find_cluster(curr_means,data):
    num_pts      = len(data)
    k            = len(curr_means)
    cluster_list = np.zeros(num_pts,dtype=int)
    dist         = np.zeros(k)
    for i in range(num_pts):
        for m in range(k):
            l       = data[i] - curr_means[m]
            dist[m] = l.dot(l)
        cluster_list[i] = np.argmin(dist)
    return cluster_list 

The second helper function updates the position of the centers by selecting from the list of all points, those that belong to the current cluster.  This function depends heavily on the slicing and numerical functions built into numpy.

def update_means(k,cluster_list,curr_data):
    updated_means = np.zeros((k,2))
    for i in range(k):
        clustered_data     = curr_data[np.where(cluster_list == i)]
        updated_means[i,:] = np.array([np.mean(clustered_data[:,0]),np.mean(clustered_data[:,1])])
    return updated_means  

For this implementation, I chose 3 clusters (k = 3) and I seeded the guess for the centers by randomly selecting from the list of the initial points.  Subsequent updates moved the centers off the tabulated points.  The update process was iterated until no changes were observed in the position of the cluster centers.  The final cluster assignments were then used to color code the points based on the cluster to which they belong.

The first run of the algorithm was successful with the resulting plot indicating that the process had done a good job of identifying the clusters similar to the way the human eye would.

Subsequent runs show that about 70-80% of the time, the algorithm correctly identifies the clusters.  The other 20-30% of the time, the algorithm gets ‘stuck’ and misidentifies the points in way that tends to split the small cluster in the bottom left into two pieces (although not always in the same way).

The ‘++’ nomenclature featured in McCaffrey’s article (K++-means clustering) refers to a more sophisticated way of selecting the initial cluster centers that minimizes the chances that the algorithm will get confused.    This new type of seed and some numerical experiments with a larger data set (in both number of points and dimensionality) will be the subject of next month’s column.