Uncategorized

The Power of Imagination

A couple of weeks ago, I wrote about the subtle difficulties surrounding the mathematics and programming of vectors.  The representation of a generic vector by a column array made the situation particularly confusing, as one type of vector was being used to represent another type.  The central idea in that post was that the representation of an object can be very seductive; it can cloud how you think about the object, or use it, or program it.

Well, this idea about representations has, itself, proven to be seductive, and has lead me to think about the human capacity that allows imagination to imbue representations of things with a life of their own.

To set the stage for this exploration, consider the well-known painting entitled The Treachery of Images by the French painter Magritte.

Magritte_pipe

The translation of the text in French at the bottom of the painting reads “This is not a pipe.”  Magritte’s point is that the image of the pipe is a representation of the idea of a pipe but is not a pipe itself; hence his choice of the word ‘treachery’ in the title of his painting.

Of course, this is exactly the point I was making in my earlier post, but a complication in my thinking arose that sheds a great deal of light on the human condition and has implications for true machine sentience.

I was reading Scott McCloud’s book Understanding Comics, when he presented a section on what makes sequential art so compelling.  In that section, McCloud talks about the inherent mystery that allows a human, virtually any human old enough to read, to imagine many things while reading a comic.  Some of the things that the reader imagines include:

  • Action takes place in the gutters between the panels
  • Written dialog is actually being spoken
  • That strokes of pencil and pen and color are actually things.

You, dear reader, are also engaging in this kind of imagining.  The words you are reading – words that I once typed – are not even pen and pencil strokes on a page.  The whole concept of page and stroke is, of course, virtual: tracings of different flows of current and voltage humming through micro-circuitry in your computer.

Not only is that painting of Magritte’s shown above not a pipe, it’s not a painting.  It is simply a play of electronic signals on a computer monitor and a physiological response in the eye.  And yet, how is it that it is so compelling?

What is the innate capacity of the human mind and the human soul to be moved by arrangements of ink on a page, by the juxtaposition of glyphs next to each other, by movement of light and color on a movie screen, by the modulated frequencies that come out of speakers and headphones?  In other words, what is the human capacity that breathes life into the signs and signals that surrounds us?

Surely someone will rejoin “it’s a by-product of evolution” or “it’s just the way we are made”.  But these types of responses, as reasonable as they may be, do nothing to address the root faculty of imagination.  They do nothing to address the creativity and the connectivity of the human mind.

As a whimsical example, consider this take on Magritte’s famous painting, inspired by the world of videogames.

Mario_pipe

Humans have that amazing ability to connect to different ideas by some tenuous association to find a marvelous (or at least funny) new thing.  The connections that lead from the ‘pipe’ you smoke to the virtual ‘pipe’ in Mario Brothers are obvious to anyone who has been exposed to both of them in context.  And yet, how do you explain them to someone who hasn’t?  Even more interesting:  how do you enable a machine to make the same connection, to find the imagery funny?  In short, how can we come to understand imagination and, perhaps, imitate it?

Maybe we really don’t want machines that actually emulate human creativity, but we won’t know or understand the limitations of machine intelligence without more fully exploring our own.  And surely one vital component of human intelligence is the ability to flow through the treachery of images into the power of imagination.

Balance and Duality

There is a commonly used device in literature that big, important events start small.  I don’t know if that’s true.  I don’t know if small things are heralds of momentous things but I do know that I received a fairly big shock from a small, almost ignorable footnote in a book.

I was reading through Theory and Problems in Logic, by John Nolt and Dennis Rohatyn, when I discovered the deadly aside.  But before I explain what surprised me so, let me say a few words about the work itself.  This book, for those who don’t know, is a Schaum’s Outline.  Despite that, it is actually a well-constructed outline on Logic.  The explanations and examples are quite useful and the material is quite comprehensive.  I think that the study of logic lends itself quite nicely to the whole approach of Schaum’s since examples seem to be heart of learning logic and the central place where logicians tangle is over some controversial argument or curious sentence like ‘this sentence is false’.

As I was skimming Nolt and Rohatyn’s discussion about how to evaluate arguments I came across this simple exercise

Is the argument below deductive?

Tommy T. reads The Wall Street Journal
$\therefore$ Tommy T. is over 3 months old.

– Nolt and Rohatyn, Theory and Problems in Logic

Their answer (which is the correct one) is that the argument above is not deductive.  At the heart of their explanation for why it isn’t deductive is the fact that while it is highly unlikely that anyone 3 months old or younger could read The Wall Street Journal, nothing completely rules it out.  Since the concept of probability enters into the argument, it cannot be deductive.

So far so good.  Of course, this is an elementary argument so I didn’t expect any surprises.

Nolt and Rohaytn go on to say that this example can be made to be deductive by the inclusion of an additional premise.  This is the standard fig-leaf of logicians, mathematicians, and, to a lesser extent, scientists the world over.  If at first your argument doesn’t succeed, redefine success by axiomatically ruling out all the stuff you don’t like.  Not that that approach is necessarily bad; it is a standard way of making problems more manageable but usually causes confusion in those not schooled in the art.

For their particular logical legerdemain, they amend the argument to read

All readers of The Wall Street Journal are over 3 months old.
Tommy T. reads The Wall Street Journal
$\therefore$ Tommy T. is over 3 months old.

– Nolt and Rohatyn, Theory and Problems in Logic

This argument is now deductive because they refuse to allow the possibility (no matter how low in probability) that those amongst us who are 3 months old are younger cannot read The Wall Street Journal. They elevate to metaphysical certitude the idea that youngsters such as they can’t by simple pronouncement.

Again there are really no surprises here and this technique is a time honored one.  It works pretty well when groping one’s way through a physical theory where one may make a pronouncement that nature forbids or allows such and such, and then one looks for the logical consequences of such a pronouncement.  But a caveat is in order.  This approach is most applicable when a few variables have been identified and/or isolated as being the major cause of the phenomenon that is being studied.  Thus it works better the simpler the system under examination is.  It is more applicable to the study of the electron than it is to the study of a molecule.  It is more applicable to the study of the molecule than to an ensemble of molecules and so on.  By the time we are attempting to apply it to really complex systems (like a 3-month old) its applicability is seriously in doubt.

Imagine then, my surprise by the innocent, little footnote associated with this exercise that reads

There is, in fact, a school of thought known as deductivism which holds that all of what we are here calling “inductive arguments” are mere fragments which must be “completed” in this way before analysis, so there are no genuine inductive arguments

– Nolt and Rohatyn, Theory and Problems in Logic

Note the language used by the pair of logicians.  Not that the deductivism school of thought wants to minimize the use of inductive arguments or maximize the use of deductive ones.  Not that its adherents want to limit the abuses that occur in inductive arguments.  Nothing so cautious as that.  Rather the blanket statement that “there are no genuine inductive arguments.”

A few minutes of exploring on the internet led me to slightly deeper understanding of the school of deductivism but only marginally so.  What could be meant by no genuine arguments?  A bit more searching led me to some arguments due to Karl Popper (see the earlier column on Black Swan Science).

These arguments, as excerpted from Popper’s The Logic of Scientific Discovery, roughly summarized, center on his uneasiness with inductive methods as applied to the empirical sciences.  In his view, an inference is called inductive if it proceeds from singular statements to universal statements.  As his example, we again see the black-swan/white-swan discussion gliding to the front.  His concern is for the ‘problem of induction’ defined as

[t]he question whether inductive inferences are justified, or under what conditions…

-Karl Popper, The Logic of Scientific Discovery

Under his analysis, Popper finds that any ‘principle of induction’ that would solve the problem of induction is doomed to failure since it would necessarily be a synthetic statement, not an analytic one.  From this observation, one would then need a ‘meta principle of induction’ to justify the principle of induction and a ‘meta-meta principle of induction’ to justify that one and so on, to an infinite regress.

Having established this initial work, Popper jumps into his argument for deductivism with the very definite statement

My own view is that the various difficulties of inductive logic here sketched are insurmountable.

-Karl Popper, The Logic of Scientific Discovery

And off he goes. By the end, he has constructed an argument that banishes inductive logic from the scientific landscape, using what, in my opinion, amounts to a massive redefinition of terms.

I’ll not try to present anymore of his argument.  The interested reader can follow the link above and read the excerpt in its entirety.  I would like to try to ask a related but, in my view, more human question.  To what end is all this work leading?  I recognize that it is important to understand how well a scientific theory is supported.  It is also important to understand the limits of knowledge and logic.  But surely, human understanding and knowledge are not limited by our scientific theories nor are they adequate described by formal logic.  Somehow, human understanding is a balance between intuition and logic, between deduction and induction.

Popper’s critiques sound too much like the sounds of someone obsessing over getting the thinking just so without stopping to ask if such a task is worth it.  Scientific discovery happens without the practitioners knowing exactly how it happens and what to call each step.  Should that be enough?

Of course, objectors to my point-of-view will be quick to point out all the missteps that logicians can see in the workings of science – all the black swans that fly in the face of a white-swan belief.  My retort is simply “so what?”

Human existence is not governed solely by logic nor should it be.  If it were, a part of the population would be frozen in indecision because terms were not defined properly, another part would be stuck in an infinite loop, and the last part would be angrily arguing with itself over the proper structure.  There is a duality between induction and deduction that works for the human race – a time to generalize from the specific to the universal and a time to deduce from the universal to the specific.

Perhaps someday, someone will perfect deductivism in such a way so that scientific discovery can happen efficiently without all the drama and controversy and uncertainty.  Maybe… but I doubt it.  After all, we know that we humans aren’t perfect – why should we expect one of our enterprises to be perfectible?

Images, Representations, and Programming

It’s an old idea.  Someone you know holds up a photograph depicting something familiar, say a beautiful car, maybe a Corvette, and asks you “what this?”.  You answer, “it’s a Corvette,” and are greeted with the cheeky response, “No! silly, it’s a picture.”

This simple joke, while annoying, makes an important philosophical point about keeping a clear distinction between the image of a thing and the thing itself.  As important as this distinction is for basic reasoning and logic, it is much more important to keep it straight in the practice of mathematics and computing – particularly in the study of vectors.

Formally, a vector is any kind of object that belongs to a class of like objects that all ‘obey’ a set of rules that define how they combine to form new objects also in the same class.  For simplicity, a vector will be denoted in underlined, bold face but, as will discussed below, there are other common ways to denote the vectors, all of which suffer from ‘Corvette-problem’ above.  The set of combination rules are:

  1. There is a combination rule ‘+’ such that U + V is a vector if U, V are vectors
  2. U + V = V + U (order doesn’t matter)
  3. U + (V + W) = (U + V) + W (the combination rule is associative)
  4. 0 + U = U + 0 = U (there is a zero vector)
  5. U + (-U) = 0 (there is a way to add up vectors to get a zero one)
  6. There is a scaling rule such that the product kU is a vector, (k is an ordinary complex number)
  7. k(U + V) = kU + kV
  8. (k+l)U = kU + lU (where k & l are ordinary complex numbers)
  9. k(lU) = (kl)U
  10. 1U = U

Some purist out there may object and point out that only occasionally does an author actually enumerate all 10 items above separately (even though such a purist will concede that all 10 must be there in some form or another).  The purist may also go on to say that some authors prefer 0U = 0 to rule #5.  But none of these details are particularly important.

What is important is that the rules are abstract and simple.  They apply equally well to vectors defined as a directed arrows as they do to vectors defined as column arrays of numbers as they do to vectors defined in terms of partial derivatives.  They apply equally well to vectors that we can observe and touch, for example pulls and pushes on an object, as they do to those that live in an abstract space like column arrays or partial derivatives, whose sole existence is built from ideas in the mind and symbols on the page.

vector_representations

As the study of vectors deepened, several clarifying points made computation with them very simple.  The most powerful, and hence most dangerous, realization is the point that an arbitrary vector can be decomposed in terms of primitive vectors, usually referred to as basis vectors.  This realization, which arguably finds its crystallization in the work of Descartes, reduces the infinity of possibilities into a manageable number of chunks and is the driving force between the 10 rules listed above.

The manageable chunks consist of a set of basic vectors whose number equals the number of dimensions in the space (1 for a line, 2 for a plane, 3 for a volume, and so on) and a list of the numbers whose length also equals the number of dimensions.

And here is the first of the traps.  Once the basis vectors are agreed-upon and understood, they can be pushed to the back and the list (called a list of components) can be manipulated without much additional thought.  The list becomes a stand-in for the original object in analogy for the way that the image of the car becomes a stand-in for the car itself.  The list is now a representation of the original object.

This blurring between the original object and its representation becomes even more fuzzy with some additional reflection.  A list is also a valid choice as an original object in the vector space since it also obeys the 10 rules (with the appropriate definition of ‘+’ and ‘x’).  To show how strange this is in the physical world, consider the possibility of getting into the picture of the car, kicking over its motor, and taking it for a spin.

It’s no wonder that otherwise well-trained and intelligent people get hung up over vectors and their manipulations each and every day.  Functionally, every object shown in the figure above is equivalent to a list of numbers.

Now suppose that one wanted to represent these abstract objects in a computer language.  Well, as long as one was careful, one could actually exploit these ambiguities and simply say that the list will always be the representation.  This is actually what most, if not all, languages do, though they differ in the terminology, with many choosing array, some choose vector, and others stay with list.

Of course, most users aren’t careful about maintaining that distinction and, I suppose, most aren’t even really conscious of it.  But one hopes that at least the language creators do.

In most cases, this hope is realized.  Many languages make the distinction between a heterogeneous list (not a vector) and a homogeneous list (which is, or at least can be, a vector).  Some languages, like those underlying the computer algebra system Maple, use the word vector to connote a special kind of list.  However, sadly, sometimes a language gets befuddled and either loses these distinctions or creates ones where none exist.

An example of the later problem comes from the numpy/scipy family of packages used in the Python programming language.  To properly discuss this minor defect in what is really a great set of packages, I need to add one more ingredient that adds a few more ingredients to the vector space turning it into a metric space.

In a metric space, there is added to the original 10 rules an additional notion of the length of a vector.  A new combination rule, usually denoted with a dot ‘.’, allows for two vectors to be combined to produce not another vector but a number, specifying how much of the length of one of the two lies along the other.  This combination is defined such that A.B = B.A and that A.A is the square of the length of A. This combination rule is called variously as a dot product, an inner product, or a scalar product.

Once defined, another operation can be derived from these 11 rules.  This operation, called the cross product, mixes the components from various places in the list to get new components.  It depends on the dot product to bring meaning to the idea of having the component from one dimension multiplying the component from another dimension and, like the dot product, actually results in an object that doesn’t (properly) belong in the vector space.  In other words, both the dot and the cross products take two vectors and produce something different.

In addition, both rules belong to the space itself since they both apply to any two pairs of vectors.  Unfortunately, the numpy/scipy team missed this concept entirely.

In numpy, the vector space as a whole can be thought of as being represented by the family of functions that make up numpy proper.  These functions include the function ‘array’ for making a new array and the function ‘cross’ for taking the cross product.  Strangely, the function ‘dot’ is not found in the collection of numpy functions but rather is a member function of the ‘array’ object itself.  A minor flaw in a really fine set of packages but a solid proof that it isn’t always easy to the tell the image from the thing.

Bayes and Drugs

One of the most curious features of Bayesian inference is the non-intuitive conclusions that can result from innocent looking observations.  A case in point is the well-known issue with mandatory drug tests being administered in a population that is mostly clean.

For the sake of this post, let’s assume that there is a drug, called Drugg, that is the new addiction on the block and that we know from arrest records and related materials that about 7 percent of the population uses it.  We want to develop a test that will detect the residuals in a person’s bloodstream thus indicating that the subject has used Drugg within some period time (e.g. two weeks) prior to the administration of the test.  The test will return a binary result with a ‘+’ indicating that the subject has used Drugg and a ‘-‘ indicating that the subject is clean.

Of course, since no test will be infallible, one of our requirements is that the test will provide an acceptably low percentage of cases that are either miss detections or false alarms. A missed detection occurs when the subject uses Drugg but the test fails to return a ‘+’.  Likewise, a false alarm occurs when the test returns a ‘-‘ but the subject is clean.  Both situations present substantial risk and potentially high costs, so the lower both percentages can be made the better.

In order to develop the test, we gather 200 subjects for clinical trials; 100 of them are known Drugg users (e.g. they were caught in the act or are seeking help with their addiction) and the remaining 100 of them are known to be clean.  After some experimentation, we have reached the stage where the 99 percent of the time, the test correctly returns a ‘+’ when administered to a Drugg user and 95 percent of the time, it correctly returns a ‘-‘ when administered to someone who is clean. What are the false alarm and missed detection rates?

This is where Bayes theorem allows us to make a statistically based inference and one that is usually surprising.  To apply the theorem, we need to be a bit careful with notation so let’s first define some additional notation.  A person who belongs to the population that uses Drugg will be denoted by ‘D’.  A person who belongs to the population that is clean will be denoted by ‘C’.  Let’s summarize what we know in the following table.

DescriptionSymbolValue
Probability of a ‘+’ given that the person is CP(+|C)0.05
Probability of a ‘-’ given that the person is CP(-|C)0.95
Probability of a ‘+’ given that the person is DP(+|D)0.99
Probability of a ‘-’ given that the person is DP(-|D)0.01
Probability that a person is CP(C)0.93
Probability that a person is DP(D)0.07

There are two things to note.  First the results of our clinical trials are all expressed as conditional probabilities.  Second, the conditional probabilities for disjoint events sum to 1 (e.g. P(+|D) + P(-|D) = 1 since a member of D, when tested, must result in either a ‘+’ or a ‘-‘).

In the population as a whole, we won’t know to which group the subject belongs.  Instead, we will administer the test and get back either a ‘+’ or a ‘-‘ and from that observation we need to infer to what group the subject is most likely to belong.

For example, let’s use Bayes theorem to infer what the missed detection probability, P(D|-) (note the role-reversal between ‘D’ and ‘-‘).  Applying the theorem we get

\[ P(D|-) = \frac{ P(-|D) P(D) }{ P(-) } \; . \]

Values for P(-|D) and P(D) are already listed above, so all we need is to get P(-) and we are in business.  This probability comes is obtained from the formula

\[ P(-) = P(-|C) P(C) + P(-|D) P(D) \; . \]

Note that this relationship can be derived from $P(-) = P(- \cap C ) + P(- \cap D)$ and $P(A \cap B) = P(A|B) P(B)$.  The first formula says, in words, that the probability of getting a negative from the test is the probability of either getting a negative and the subject is clean or getting a negative and the subject uses Drugg.  The second formula is essentially the definition of conditional probability.

Since we’ll be needing them P(+) as well, let’s compute them both now and note their values.

DescriptionFormulaSymbolValue
Probability of a ‘+’ given that the person either is in C or D\[ P(+) = P(+|C) P(C) + P(+|D) P(D) \]P(+)0.1158
Probability of a ‘-’ given that the person either is in C or D\[ P(-) = P(-|C) P(C) + P(-|D) P(D) \]P(-)0.8842

The missed detection probability is

\[ P(D|-) = \frac{ P(-|D) P(D) }{ P(-) } = \frac{ 0.01 \cdot 0.07 }{ 0.8842 } = 0.0008 \;  . \]

So things are looking good and we are happy.  But our joy soon turns to perplexity when we compute the false alarm probability

\[ P(C|+) = \frac{ P(+|C) P(C) }{ P(+) } = \frac{ 0.05 \cdot 0.93 }{ 0.1158 } = 0.4016 \; . \]

This result says that around 40 percent of the time, our test is going to incorrectly point a finger at a clean person.

Suppose we went back to our clinical trials and came out with the second version of the test where nothing had changed except P(-|C) had now risen from 0.95 to 0.99.  As the figure below shows, the false alarm rate does decrease but still remains very high (surprisingly high) when the percentage of the population using Drugg is low.

Drugg Testing

The reason for this is that when the percentage of users in the population is small in order to get the missed detection rate low we have to do it at the expense of a greater percentage of false alarms.  In other words, our diligence in finding Drugg users has made us overly suspicious.

Bayesian Inference – Cause and Effect

In the last column, the basic inner workings of Bayes theorem were demonstrated in the case where two different random variable realizations (the attributes of the Christmas tree bulbs) occurred together in a joint probability function.  The theorem holds whether the probability functions for the two events are independent or are correlated.  In addition, it can be generalized in an obvious way to cases where there are more than two variables and where one some or all of them are continuous rather than discrete random variables.

If that were all there was to it – a mechanical demonstration between conditional and joint probabilities – Bayes theorem would make a curious footnote in probability and statistics textbooks and would hold little practical interest and no controversy.   However, the real power of Bayes theorem comes in ability to link one statistical event with another and to allow inferences to be made about cause and effect.

Before looking at how inferences (sometimes very subtle and non-intuitive) can be drawn, let’s take a moment to step back and consider why Bayes theorem works.

The key insight come from examining the meaning contained in the joint probability that two events, $A$ and $B$, will both occur.  This probability is written as

\[ P( A \cap B ) \; , \]

where the operator $\cap$ is the logical ‘and’ requiring both $A$ and $B$ to be true.  It is at this point that the philosophically interesting implications can be made.

Suppose that we believe that $A$ is a cause of $B$.  This causal link could take the form of something like: $A$ = ‘it was raining’ and $B$ = ‘the ground is wet’.  Then it is obvious that the joint probability takes the form

\[ P( A \cap B ) = P(B|A) P(A) \; , \]

which in words says that the probability that ‘it was raining and the ground is wet’ = the probability that ‘the ground is wet given that it was raining’ times the probability that ‘it was raining’.

Sometimes, the link between cause and effect is obvious and no probabilistic reasoning is required.  For example, if the event is changed from ‘it was raining’ to ‘it is raining’, it becomes clear that ‘the ground is wet’ due to the rain.  (Of course even in this case, another factor may also be contributing to how wet the ground is but that complication is naturally handled with the conditional probability).

Often, however, we don’t observe the direct connection between the cause and the effect.  Maybe we woke up after the rain had stopped and the clouds had moved on and all we observe is that the ground is wet.  What can we then infer?  If we lived somewhere without running water (natural or man-made), then the conditional probability ‘that the ground is wet given that is was raining’ would be 1 and we would infer that ‘it was raining’.  There would be no way for the ground to be wet other than to have had rain fall from the sky.  In general, such a clear indication between cause and effect doesn’t happen and the conditional probability describes the likelihood that some other cause has led to the same event.  In the case of the ‘ground is wet’ event perhaps a water main had burst or a neighbor had watered their lawn.

In order to infer anything about the cause from the observed effect, we want to reverse the roles of $A$ and $B$ and argue backwards, as it were.  The joint probability can be written with the mathematical roles of $A$ and $B$ reversed to yield

\[ P( A \cap B ) = P(A|B) P(B) \; , \]

Equating the two expressions for the joint probability gives Bayes theorem and also a way of statistically inferring the likelihood that a particular cause $A$ gave the observed effect $B$.

Of course any inference obtained in this fashion is open to a great deal of doubt and scrutiny due to the fact that the link backwards from observation to proposed or inferred origin is one built on probabilities.  Without some overriding philosophical principle (e.g. a conservation law) it is easy to confuse coincidence or correlation with causation. Inductive reasoning can then lead to probabilistically support but untrue conclusions like all swans are white – so we have to be on our guard.

Next week’s column will showcase one such trap within the context of mandatory drug testing.

Bayesian Inference – The Basics

In last week’s article, I discussed some of the interesting contributions to the scientific method made by the pair of English Bacons, Roger and Francis.  A common and central theme to both of their approaches is the emphasis they placed on performing experiments and then inferring from those experiments what the logical underpinning was.  Put another way, both of these philosophers advocated inductive reasoning as a powerful tool for understanding nature.

One of the problems with the inductive approach is that in generalizing from a few observations to a proposed universal law one may overreach.  It is true that, in the physical sciences, great generalizations have been made (e.g., Newton’s universal law of gravity or the conservation of energy) but these have ultimately rested on some well-supported philosophical principles.

For example, the conservation of momentum rests on a fundamental principle that is hard to refute in any reasonable way; that space has no preferred origin.  This is a point that we would be loath to give up because it would imply that there was some special place in the universe.  But since all places are connected (otherwise they can’t be places) how would nature know to make one of them the preferred spot and how would it keep such a spot inviolate?

But in other matters, where no appeal can be made to an over-arching principle as a guide, the inductive approach can be quite problematic.  The classic and often used example of the black swan is a case in point.  Usually the best that can be done in these cases is to make a probabilistic generalization.  We infer that such and such is the most likely explanation but by no means necessarily the correct one.

The probabilistic approach is time honored.  William of Occam’s dictum that the simplest explanation that fits all the available facts is usually the correct one is, at its heart, a statement about probabilities.  Furthermore, general laws of nature started out as merely suppositions until enough evidence and corresponding development of theory and concepts led to the principles upon which our confidence rests.

So the only thorny questions are what are meant by ‘fact’ and ‘simplest’.  On these points, opinions vary and much argument ensues.  In this post, I’ll be exploring one of the more favored approaches for inductive inference known as the Bayesian method.

The entire method is based on the theorem attributed to Thomas Bayes, a Presbyterian minister, and statistician, who first published this law in the latter half of the 1700s.  It was later refined by Pierre Simon Laplace, in 1812.

The theorem is very easy to write down, and that perhaps is what hides its power and charm.  We start by assuming that two random events, $A$ and $B$, can occur, each according to some probability distribution.  The random events can be anything at all and don’t have to be causally connected or correlated.  Each event has some possible set of outcomes $a_1, a_2, \ldots$ and $b_1, b_2, \ldots$.  Mathematically, the theorem is written as

\[ P(a_i|b_j) = \frac{P(b_j|a_i) P(a_i)}{P(b_j)} \; , \]

where $a_i$ and $b_j$ are some specific outcomes of the events $A$ and $B$ and $P(a_i|b_j)$ ($P(b_j|a_i)$) is called the conditional probability that $a_i$ ($b_j$) results given that we know that $b_j$ ($a_i$) happened.  As advertised it is nice and simple to write down and yet amazingly rich and complex in its applications.  To understand the theorem, let’s consider a practical case where the events $A$ and $B$ take on some easy-to-understand meaning.

Suppose that we are getting ready for Christmas and want to decorate our tree with the classic strings of different-colored lights.  We decide to a purchase a big box of bulbs of assorted colors from the Christmas light manufacturer, Brighty-Lite, who provides bulbs in red, blue, green, and yellow.  Allow the set $A$ to represent the colors

\[ A = \left\{\text{red}, \text{blue}, \text{green}, \text{yellow} \right\} = \left\{r,b,g,y\right\} \; . \]

On its website, Brighty-Lite proudly tells us that they have tweaked their color distribution in the variety pack to best match their customer’s desires.  They list their distribution as consisting of 30% percent red and blue, 25% green, and 15% yellow.  So the probabilities associated with reaching into the box and pulling out a bulb of a particular color are

\[ P(A) = \left\{ P(r), P(b), P(g), P(y) \right\} = \left\{0.30, 0.30, 0.25, 0.15 \right\} \; . \]

The price for bulbs from Brighty-Lite is very attractive, but being cautious people, we are curious how long the bulbs will last before burning out.   We find a local university that put its undergraduates to good use testing the lifetimes of these bulbs.  For ease of use, they categorized their results into three bins: short, medium, and long lifetimes. Allowing the set $B$ to represent the lifetimes

\[ B = \left\{\text{short}, \text{medium}, \text{long} \right\} = \left\{s,m,l\right\} \]

the student results are reported as

\[ P(B) = \left\{ P(s), P(m), P(l) \right\} = \left\{0.40, 0.35, 0.25 \right\} \; , \]

which confirmed our suspicions that Brighty-Lite doesn’t make its bulbs to last.  However, since we don’t plan on having the lights on all the time, we decide to buy a box.

After receiving the box and buying the tree, we set aside a weekend for decorating.  Come Friday night we start by putting up the lights and, as we work, we start wondering whether all colors have the same lifetime distribution or whether some colors are more prone to be short-lived compared with the others. The probability distribution that describes the color of the bulb and its lifetime is known as the joint probability distribution.

If the bulb color doesn’t have any effect on the lifetime of the filament, then the events are independent, and the joint probability of, say, a red bulb with a medium lifetime is given by the product of the probability that the bulb is red and the probability that it has a medium lifespan (symbolically $P(r,m) = P(r) P(m)$).

The entire full joint probability distribution is thus

  red blue green yellow  
short 0.12 0.12 0.1 0.06 0.40
medium 0.105 0.105 0.0875 0.0525 0.35
long 0.075 0.075 0.0625 0.0375 0.25
  0.30 0.30 0.25 0.15  

Now we are in a position to see Bayes theorem in action.  Suppose that we pull out a green bulb from the box.  The conditional probability that the lifetime is short $P(s|g)$ is the relative proportion that the green and short entry $P(g,s)$ has compared to the sum of the probabilities $P(g)$ found in the column labeled green.  Numerically,

\[ P(s|g) = \frac{P(g,s)}{P(g)} = \frac{0.1}{0.25} = 0.4 \; . \]

Another way to write this is as

\[ P(s|g) = \frac{P(g,s)}{P(g,s) + P(g,m) + P(g,l)} \; , \]

which better shows that the conditional probability is the relative proportion within the column headed by the label green.

Likewise, the conditional probability that the bulb is green given that its lifetime is short is

\[ P(g|s) = \frac{ P(g,s) }{P(r,s) + P(b,s) + P(g,s) + P(y,s)} \; . \]

Notice that this time the relative proportion is measured against joint probabilities across the colors (i.e., across the row labeled short). Numerically, $P(g|s) = 0.1/0.4 = 0.25$.

Bayes theorem links these two probabilities through

\[ P(s|g) = \frac{ P(g|s) P(s) }{ P(g) } = \frac{0.25 \cdot 0.4}{0.25} = 0.4 \; , \]

which is happily the value we got from working directly with the joint probabilities.

The next day, we did some more cyber-digging and found that a group of graduate students at the same university extended the undergraduate results (were they perhaps the same people?) and reported the following joint probability distribution:

 

  red blue green yellow  
short 0.15 0.10 0.05 0.10 0.40
medium 0.05 0.12 0.15 0.03 0.35
long 0.10 0.08 0.05 0.02 0.25
  0.30 0.30 0.25 0.15  

Sadly, we noticed that our assumption of independence between the lifetime and color was not borne out by experiment since $P(A,B) \neq P(A) \cdot P(B)$ or in more explicit terms $P(color,lifetime) \neq P(color) P(lifetime)$.  However, we were not completely disheartened since Bayes theorem relates relative proportions and, therefore, it might still work.

Trying it out, we computed

\[ P(s|g) = \frac{P(g,s)}{P(g,s) + P(g,m) + P(g,l)} = \frac{0.05}{0.05 + 0.15 + 0.05} = 0.2 \]

and

\[ P(g|s) = \frac{ P(g,s) }{P(r,s) + P(b,s) + P(g,s) + P(y,s)} \\ = \frac{0.05}{0.15 + 0.10 + 0.05 + 0.10} = 0.125 \; . \]

Checking Bayes theorem, we found

\[ P(s|g) = \frac{ P(g|s) P(s) }{ P(g) } = \frac{0.125 \cdot 0.4}{0.25} = 0.2 \]

guaranteeing a happy and merry Christmas for all.

Next time, I’ll show how this innocent looking computation can be put to subtle use in inferring cause and effect.

Bringing Home the Bacon

Don’t worry; this week’s entry is not about America’s favorite pork-related product (seriously there exists bacon-flavored candy).  It’s about the scientific method.  Not the whole thing, of course, as that would take volumes and volumes of text and would be outdated and maybe obsolete by the time it was finished.  No, this column is about two men who are considered by science historians to have contributed substantially to the scientific method and the philosophy of science.  And it just so happens that both of them bore the last name of Bacon.

Roger Bacon was born somewhere around 1214 (give or take – time and record keeping then, as now, was hard to do) in England.  Roger became both an English philosopher of note and a Franciscan friar.  Most of the best scholastic philosophers of the Middle Ages were monks, and in taking Holy Orders, Bacon falls amongst the ranks of other prominent thinking religious, including Robert Grosseteste, Albert MagnusThomas AquinasJohn Duns Scotus, and William of Ockham.

It seems that the cultural milieu of that time was planting the intellectual seeds for the scientific and artistic renaissance that followed. Roger Bacon cultivated modes of thought that would be needed for the advances to come.  Basing his philosophy on Aristote, he advocated for the following ‘modern’ ideas:

  • Experimental testing for all inductively derived conclusions
  • Rejection of bling following of prior authorities
  • Repeating pattern of observation, hypothesis, and testing
  • Independent corroboration and verification

In addition, he wrote extensive on science, both its general structure and on specific applications.  Among his particular fields of interest was optics, where his diagrams have the look and feel of the modern experimental lab notebook.

Roger_Bacon_optics01

He also criticized the Julian day and argued for dropping a day every 125 years.  This system would not be adopted until about 300 years after his death with the creation of the Gregorian calendar in 1582.  He was almost an outspoken supporter experimental science saying that it had three great prerogatives over other sciences and arts in that:

  • It verifies all of its conclusions by direct experiment
  • It discovers truths which can’t be reached without observation
  • It reveals the secrets of nature

Francis Bacon was born in 1561 in England.  He was a government official (Attorney General and Lord Chancellor) and a well-known philosopher.  His writings on science and philosophy established a firm footing for inductive methods used for scientific inquiry.  The details of the method are collectively known as the Baconian Method or the scientific method.

In his work Novum Organum (literally the new Organon referring to Aristotle’s work on metaphysics and logic), Francis has this to say about induction:

Our only hope, then is in genuine Induction… There is the same degree of licentiousness and error in forming Axioms, as in abstracting Notions: and that in the first principles, which depend in common induction. Still more is this the case in Axioms and inferior propositions derived from Syllogisms.

By induction, he meant the careful gathering of data and then refinement of a theory from those observations.

Curiously, both Bacons talk about four errors that interfere with the acquisition of knowledge:  Roger does so in his Opus Majus; Francis in his Novum Organum.  The following table makes an attempt to match up each’s list.

Roger Bacon’s Four Causes of Error Francis Bacon’s Four Idols of the Mind
Authority
(reliance on prior authority)
Idols of the Theater
(following academic dogma)
Custom Idols of the Tribe
(tendency of humans to see order where it isn’t)
Opinion of the unskilled many Idols of the Marketplace
(confusion in the use of language)
Concealment of ignorance behind the mask of knowledge Idols of the Cave
(interference from personal beliefs, likes, and dislikes)

While not an exact match, the two Baconian lists of errors match up fairly well, which is puzzling if historic assumption that Francis Bacon had no access to the works of Roger Bacon is true.  Perhaps the most logical explanation is that both of them saw the same patterns of error; that human kind doesn’t change its fundamental nature in the passage of time or space. 

Or perhaps Francis is simply the reincarnation of Roger, an explanation that I am positively sure William of Occam would endorse if he were alive today…

Ideal Forms and Error

A central concept of Socratic and Platonic thought is the idea of an ideal form.  It sits at the base of all discussions about knowledge and epistemology.  Any rectangle that we draw on paper or in a drawing software package, that we construct using rulers and scissors, or manufacture with computer controlled fabrication is a shadow or reflection of the ideal rectangle.  This ideal rectangle exists in the space of forms, which may be entirely within the human capacity to understand the world and distinguish or may actually have an independent existence outside the human mind, reflecting a high power.  All of these notions about the ideal forms are familiar from the philosophy from antiquity.

What isn’t so clear is what Plato’s reaction would be if he were suddenly transported forward in time and plunked down in a classroom discussion about the propagation of error.  The intriguing question is would he modify his philosophical thought to expand the concept of an ideal form to include and ideal form of error?

Let’s see if I can make this question concrete by the use of an example.  Consider a diagram representing an ideal rectangle of length $L$ and height $H$.

true_rectangle

Euclidean geometry tells us that the area of such a rectangle is given by the product

\[ A = L \cdot H \; . \]

Of course, the rectangle represented in the diagram doesn’t really exist since there are always imperfections and physical limitations.  The usual strategy is to not take the world as we would like it to be but to take it as it is and cope with these departures from the ideal.

The departures from the ideal can be classified into two broad categories.

The first category, called knowledge error, contains all of the errors in our ability to know.  For example, we do not know exactly what numerical value to give the length $L$.  There are fundamental limitations on our ability to measure or represent the numerical value of $L$ and so we know the ‘true’ value of $L$ only to within some fuzzy approximation.

The second category doesn’t seem to have a universally agreed-upon name, reflecting the fact that, as a society, we are still coming to grips with the implications of this idea.  This departure from the ideal describes the fact that at some level there may not even be on definable concept of true.  Essentially, the idea of the length of an object is context-dependent and may have no absolutely clear idea at the atomic level due to the inherent uncertainty in quantum mechanics.  This type of ‘error’ is sometimes called aleatory error (in contrast to epistemic error; synonymous with knowledge error).

Taken together, the knowledge and aleatory errors contribute to an uncertainty in length of the rectangle of $dL$ and an uncertainty in its height of $dH$.

error_rectangle

Scientists and engineers are commonly exposed to a model in determining the error in the area of such a rectangle as part of their training to deal with uncertainty and error in a formula sometimes called the propagation of error (or uncertainty).  For the case of this error-bound rectangle, the true area, $A’$, is determined also in Euclidean fashion yielding

\[ A’ = (L+dL) \cdot (H+dH) = L \cdot H + dL \cdot H + L \cdot dH + dL \cdot dH .\]

So the error in the error in the area, denoted as $dA$, has a more complicated form that the area itself

\[ dA = dL \cdot H + L \cdot dH + dL \cdot dH \; . \]

Now suppose that Plato were in the classroom when this lesson was taught.  What would his reaction be?  I bring this up because although the treatment above is meant to handle error it is still an idealization.  There is still a notion of an ideal rectangle sitting underneath.

The curious question that follows in its train is this:  is there an ideal form for this error idealization?  In other words, is there a perfect or ideal error in the space of forms of which our particular error discussion is a shadow or reflection?

It may sound like this question if predicated on a contradiction but my contention is that it only seems so, on the surface.  In understanding the propagation of error in the calculation of the rectangle I’ve had to assume a particular functional relationship.

It is a profound assumption that the object drawn above (not what it represents but that object itself), which is called a rectangle but which is embodied in the real world as made up of atomic parts (be they physical atoms or pixels), can be characterized by two numbers ($L$ and $H$) even if I don’t know what values $L$ and $H$ take on.  In some sense, this idealization should sit in the space of forms.

But if that is true, what stops us there.  Suppose we had a more complex functional relationship, something, say, that tries to model the boundaries of the object as a set of curves that deviate much from linearity but enough to capture a shaky hand when the object was drawn or a manufacturing process with deviations when machined. Is this model not also an idealization and therefore a reflection of something within the space of forms?

And why stop there. It seems to me that the boundary line between what is and is not in the space of forms is arbitrary (and perhaps self-referential – is the boundary between what is and is not in the space of forms itself in the space of forms).  Like levels of abstraction in a computer model depend on the context, could not the space of forms depend on the questions that are being asked.

Perhaps the space of forms is as infinite or as finite as we need it to be.  Perhaps its forms all the way down.

Why do We Teach the Earth is Round?

You’re no doubt asking yourself “Why the provocative title?  It’s obvious why we should teach that the Earth is round!” In some sense, this was my initial reaction when this exact question was posed in a round table discussion that I participated in recently.  The person who posed the question was undaunted by the initial pushback and persisted.  Her point was simply a genuinely honest question driven by a certain pragmatism.

Her basic premise is this.  For the vast majority of people on the Earth, a flat Earth model best fits their daily experiences.  None of us plan our day-to-day trips using the geometry of Gauss.  Many of us fly, but far fewer of us fly long enough distances where the pilot or navigator consciously lays in great circle path.  And even if all of us were to fly, say from New York to Rome, so what if the path the plane follows is a ‘geodesic on the sphere’, very few of us are either aware or care.  After all, that is someone else’s job to do.  And certainly gone are the days where we sit at the seashore and watch the masts of ships disappear last over the horizon – cell phones and the internet are far more interesting.

I listened to the argument carefully and mulled it over a few days and realized that there was a lot of truth in it.  The points here weren’t that we shouldn’t teach that the Earth is round but rather that we should know with a firm and articulable conviction why we should teach it and that that criteria for inclusion should be open to debate when schools draw up their curriculum.

So what criteria should be used to construct a firm and articulable conviction? It seems that at the core of this question was a dividing line between types of knowledge and why we would care to know one over the other.

The first distinction in our round-Earth epistemological exploration is one between what I will call tangible and intangible knowledge.  Tangible knowledge consists of all those facts that have an immediate impact on a person’s everyday existence.  For example, knowing that a particular road bogs down in the afternoon is a slice of tangible knowledge because acting on it can prevent me from arriving home late for dinner (or perhaps having no dinner at all).  Knowing that the rainbow is formed by light entering a water droplet in the atmosphere in a particular way so that it is subjected to a single total internal reflection before exiting the drop with the visible light substantially dispersed is an intangible fact, since I am neither a farmer nor a meteorologist.  Many are the people who have said “don’t tell me how a rainbow is formed – it ruins all the beauty and poetry!”

An immediate corollary of this distinction is that what is tangible and intangible knowledge is governed by what impacts a person’s life.  It differs both from person to person and over time.  A person who doesn’t drive the particular stretch of road that I do would find the knowledge that my route home bogs down at certain times and the meteorologist would find the physical mechanism for the rainbow a tangible bit of knowledge, even if it kills the poet in him.

The second distinction is between what I will call private and common knowledge.  The particular PIN I use to access by phone is knowledge that is, and should, remain private to me.  In the hands of others it is either useless (for the vast majority who are either honest, or don’t know, or both) or it is dangerous (for those who do know me and are up to no good).  Common knowledge describes those facts that can be shared with no harm between all people.  Knowing how electromagnetic waves propagate is an example of common knowledge but knowing a particular frequency to intercept enemy communications is private.

With these distinctions in hand, it is now easy to see what was meant by the original, provocative question.  As it is taught in schools, knowledge that the Earth is round is, for most people, a common, intangible slice of human knowledge.  In this context, it is reasonable to ask why we even teach it to the students.

A far better course of action is to try to transform this discovery into a common but tangible slice of knowledge that effects each student on core level.  The particular ways that this can be done are numerous but let me suggest one that I regard as particularly important.

Fancy earth

Teaching that the Earth is round should be done within a broader context of how do we know anything about the world around it, how certain are we, and where are the corners of doubt and uncertainty.  A common misconception is that the knowledge that the Earth is round was lost during the Dark and early Middle Ages.  The ancient Greeks knew with a great deal of certainty that the Earth was round and books from antiquity tell the story of how Eratosthenes determined the radius of the Earth to an astounding accuracy considering the technology of his day.  This discovery persisted into the Dark and Middle Ages and was finally put to some practical use only when the collective technology of the world progressed to the point that the voyages of Columbus and Magellan were possible.  Framing the lesson of the Earth’s roundness in this way provides a historical context that elevates it from mere geometry into a societally shaping event.  Science, technology, sociology, geography, and human affairs are all intertwined and should be taught as so.

Along the way, numerous departure points are afforded to discuss other facets of what society knows and how does it know it.  Modern discoveries that the Earth is not a particularly spherical (equatorial bulge) know take on a life outside of geodesy and the concepts of approximations, models, and contexts by which ‘facts’ are known and consumed now become tools for honing critical thinking about a host of policy decision each and every one of us has to make.

By articulating the philosophical underpinnings for choosing a particular curriculum, society can be sure that arbitrary decisions about what topics are taught can be held in check. Different segments can openly debate what material should be included and what can be safely omitted in an above board manner.  Emotional and aesthetic points can be addressed side-by-side with practical points without confusion.  And all the while we can be sure that development of critical thinking is center stage.

Failure to do this leaves two dangerous scenarios.  The first is that student is filled with a lot of unconnected facts that improve neither his civic participation in practical matters nor his general appreciation for the beauty of the world.  The second, and more importantly, the student is left with the impression that science delivers to us unassailable facts.  This is a dangerous position since it leads to modern interpretations of science as a new type of religion whose dogma has replaced the older dogma of the spiritual simply by virtue that its magic (microwaves, TVs, cell-phones, rockets, nuclear power, and so on) is more powerful and apparent.

Self-Reference and Paradoxes

The essence of the Gödel idea is to encode not just the facts but also the ‘facts about the facts’ of the formal system being examined within the framework of the system being examined.  This meta-mathematics technique allowed Gödel to prove simple facts like ‘2 + 2 = 4’ and hard facts like ‘not all true statements are axioms or are theorems – some are simply out of reach of the formal system to prove’ within the context of the system itself.  The hard facts come from the system talking about or referring to itself with its own language.

As astonishing as Godel’s theorem is, the concept of paradoxes within self-referential systems is actually a very common experience in natural language.  All of us have played at one time or another with odd sentences like ‘This sentence is false!’.  Examined from a strictly mechanical and logical vantage, how should that sentence be parsed?  If the sentence is true then it is lying to us.  If it is false, then it is sweetly and innocently telling us the truth.  This example of the liar’s paradox has been known since antiquity and variation of it have appeared throughout the ages in stories of all sorts.

Perhaps the most famous example comes from the original Star Trek television series in an episode entitled ‘I Mudd’. In this installment of the ongoing adventures of the starship Enterprise, an impish Captain Kirk defeats a colony of androids that hold him and his crew hostage by exploiting their inability to be meta.

There are actually host of paradoxes (or antinomies in the technical speak) that some dwerping around on the internet can uncover in just a handful of clicks.  They all arise when a formal system talks about itself in its own language and often their paradoxical nature arises when they talk about something of a negative nature.  The sentence ‘This sentence is true,’ is fine while ‘This sentence is false.’ is not.

Not all of the examples show up as either interesting but useless tricks of the spoken language or as formal encodings in mathematical logic.  One of the most interesting cases deals with libraries of either the brick and mortar variety or existing solely on hard drives and in RAM and FTP packets.

Consider for a moment that you’ve been given charge of a library.  Properly speaking, a library has two basic components: the books to read and a system to catalog and locate the books so that they can be read.  Now thinking about the books is no problem.  They are the atoms of the system and so can be examined separately or in groups or classes.  It is reasonable and natural to talk about a single book like ‘Moby Dick’ and to catalog this book along with all the other separate works that the library contains.  It is also reasonable and natural to talk about all books written by Herman Melville and to catalog them within a new list with a title perhaps with the name ‘Lists of works by H. Melville’.  A similar list can be made with grouping criterion selects books about the books by Melville.  This list would have a title like ‘List of critiques and reviews of the works by H. Melville’.

An obvious extension would be to construct something like the following list.

List of Author Critiques and Reviews:

  • List of critiques and reviews of H. Melville
  • List of critiques and reviews of J. R. R. Tolkien
  • List of critiques and reviews of U. Eco
  • List of critiques and reviews of R. Stout
  • List of critiques and reviews of G. K. Chesterton
  • List of critiques and reviews of A. Christie
  • ….

Since the lists are themselves written works what status do they have in the cataloging system?  Should there also be lists of lists?  If so, how deep should there construction go?   At some point won’t we arrive at lists that have to refer to themselves and what do we do when we reach that point?  Should the library catalog have a reference to itself as a written work?

Bertrand Russell wrestled with these questions in the context of set theory around the turn of the 20th century.  To continue on with the library example, Russell would label the ‘List of Author Critiques and Reviews’ as a normal set since it is a collection of things that doesn’t include itself.  He would also label as an abnormal set, any list that would have itself as a member – in this case a catalog (i.e. list) of all lists pertaining to the library.  General feeling suggests that the normal sets are well behaved but the abnormal sets are likely to cause problems.  So let’s just focus on the normal sets.  Russell asks the following question about the normal sets:  Is the set, R, of all normal sets, itself normal or abnormal?  If R is normal, then it must appear as a member in its own listing, thus making R abnormal.  Alternatively, if R is abnormal, it can’t be listed as a member within itself and, therefore, it must be normal.  No matter which way you start you are led to a contradiction.

The natural tendency is, at this point, to cry foul and to suggest that the whole thing is being drawn out to an absurd length.  Short and simple answers to each of the questions posed in the earlier paragraph come to mind with the application of a little common sense.  Lists should only be themselves cataloged if they are independent works that are distinct parts of the library.  The overall library catalog need not list itself because it primary function is to help the patron find all the other books, publications, and related works in the library.  If the patron can find the catalog, then there is no need to have it listed within itself.  One the other hand, if the patron cannot find the catalog, having it listed within itself serves no purpose – the patron will need something else to point him towards the catalog.

And as far as Russell and perfidious paradox is concerned, who cares?  This might be a matter to worry about if one is a stuffy logician who can’t get a date on a Saturday night but normal people (does this mean Russell and his kind are abnormal?) have better things to do with their lives than worry about such ridiculous ideas.

Despite these responses, or maybe because of them, we should care.  Application of common sense is actually quite sophisticated even if we are quite unaware of the subtleties involved.  In all of these common-sensical responses there is an implicit assumption about something above or outside.  If the patron can’t find the library catalog, well then that is what a librarian is for – to point the way to the catalog.  The librarian doesn’t need to be referred to or listed in the catalog.  He sits outside the system and can act as an entry point into the system.  If there is a paradox in set theory, not to worry, there are more important things than complete consistency in formal systems.

This is concept of sitting outside the system, is at the heart of the current differences between human intelligence and machine intelligence.  The later, codified by the formal rules of logic, can’t resolve these kinds of paradoxes precisely because they can’t step outside themselves like people can. And maybe they never will.