Author Archive: Conrad Schiff

Is True Of – Part 2 Linguistic Analog

This month’s post is part 2 of 2 of the exploration of the YouTube video entitled Russell’s Paradox – a simple explanation of a profound problem by Jeffery Kaplan.  What we discussed last month was the summary of Russell’s paradox in which we found that the set of all ordinary sets defined as

\[ {\mathcal Q} = \{ x | x \mathrm{\;is\;a\;set\;that\;does\;not\;contain\;itself} \} \; \]

creates a paradox.  If we assume ${\mathcal Q}$ does not contain itself (i.e., it is ordinary) then the membership comprehension ‘is a set that does not contain itself’ instructs us that ${\mathcal Q}$ in fact does contain itself.  Alternatively, if we assume that ${\mathcal Q}$ does contain itself (i.e., it is extraordinary) then membership comprehension instructs us that it doesn’t. 

This type of self-referential paradox mirrors other well-known paradoxes that arise in linguistics such as the liar’s paradox or the concept of the self-refuting idea.  What makes Kaplan’s analysis interesting (whether or not it originates with him) is the very strong formal analogy that he draws between the common act of predication that we all engage in nearly continuously, and the more abstruse structure of Russell’s paradox that few among us know or care about.

The heart of Kaplan’s analogy is the explicit mapping of the ‘contains’ idea from set theory – that sets contain members or elements, some of which are sets, including themselves – with the ‘is true of’ idea of predication. 

To understand what Kaplan means by the ‘is true of’, we will again follow the structure of his examples with some minor verbal modifications to better suit my own taste.

Predication is the act of saying something about the subject of a sentence by giving a condition or attribute belonging to the subject.  In the following sentence

Frodo is a brave hobbit.

The subject of the sentence is “Frodo” and the predicate is “is a brave hobbit”.  The predicate “is a brave hobbit” is true of Frodo, as anyone who’s read Lord of the Rings can attest.  Kaplan then points out that the first basic rule of naïve set theory, which he states as

Rule #1 of sets: there is a set for any imaginable collection of a thing or of things”

has, as its formal analogy in predication, the following:

Rule #1 of predication: there is a predicate for any imaginable characteristic of a thing.

The two key rules of set theory that lead to Russell’s paradox have their analogs in predication as well.   

Rule #10 of sets, which allows us to have sets of sets, is mirrored by Rule #10 of predication that tells us we can predicate things about predicates.  As an example of this, consider the following sentence:

“Is a Nazgul” is a terrifying thing to hear said of someone.

The predicate “Is a Nazgul” is the subject of that sentence and “is a terrifying thing to hear said of someone” is the predicate.

Rule #11 of sets, which allows sets to contain themselves (i.e., self-reference), finds its analog in Rule #11 of predication that tells us that predicates can be true of themselves.

Here we must proceed a bit more carefully.  Let’s start with a simple counterexample:

“Is a hobbit” is a hobbit.

This sentence is clearly false as the subject, the predicate “Is a hobbit”, is clearly not a hobbit itself, it is a predicate.  But now consider the following sentence, which Kaplan offers:

“Is a predicate” is a predicate

This sentence is clearly true as the subject, the predicate “Is a predicate”, is clearly a predicate.  And, so, Rule #11 of predication works.

Kaplan then constructs a table similar to the following (again only minor verbal tweaks are done for the predicates that are not true of themselves to suit my own taste)

Predicates not true of themselvesPredicates true of themselves
“is a brave hobbit”“is a predicate”
“is a Nazgul”“is a string of words”
“keeps his oaths”“typically comes at the end of a sentence”

Note that the predicate “is true of itself” is a predicate that is true of all the predicates that are true of themselves, that is to say, of all the predicates that can be placed in the right column of the table above. The next step is then to ask what is the predicate of all the predicates that can be placed in the left column of the above table.  A little reflection should satisfy oneself that the predicate “is not true of itself” fits the bill. 

The final step is to ask in which of the two column does “is not true of itself” fall, or, in other words,

is “is not true of itself” true of itself?

If we assume that it is true of itself then the content found between the quotes tells us that it is not true of itself.  Equally vexing, if we assume that it is not true of itself, that assumption matching the content found between the quotes tells us that it is true of itself.  In summary: if it is then it isn’t and if it isn’t then it is.  And we’ve generated the predicate analogy to Russell’s paradox.

Of course, this is just a form of the well-known Liar’s Paradox, so we might be willing to just shrug it off as a quirk of language, but I think Kaplan is making a deeper point that is worth deeply considering.  At the root of his analysis is the realization that there are objective rules (or truths), that these rules generate self-referential paradoxes, and, so, one is forced to recognize that paradoxes are an essential ingredient in not just all of language but of thought itself.  And no amount of patching, such as was done to naïve set theory, can rescue us from this situation.  This observation, in turn, has the profound philosophical implication that there is only so far that logic can take us.

Is True Of – Part 1: Russell Redux

This month’s post and the next one are based on the YouTube video entitled Russell’s Paradox – a simple explanation of a profound problem by Jeffery Kaplan.  In that video, Kaplan links the well-known mathematical paradox found by Bertrand Russell with familiar linguistic paradoxes in a what a friend characterized as a “stimulating alternative perspective”. 

Russell’s paradox shook the foundations of mathematics by casting doubt on whether it were possible to logically establish mathematics as an objective discipline with rules that reflect something beyond human convention or construction.  And while, in the aftermath, a variety of patches were proposed that sidestep the issue by eliminating certain constructions, Kaplan’s essential point is that the same mental processes that lead to Russell’s paradox lead to logical paradoxes in natural language.  These natural language paradoxes, in turn, reflect something deeper in how we think and, as a result, we can’t sidestep these processes in everyday life the way they are currently and narrowly sidestepped in mathematics.  Paradoxes seem to be built into the fabric of human existence.

To be clear, his essential point is not novel and the connections that exist in human thought between formal logic, mathematical logic, and linguistics have been covered in within a variety of contexts including the liar’s paradox.  What is intriguing about Kaplan’s analysis is the method he employs to make this point using the basic function of predication within natural language.  I don’t know if his argument is truly one of his own making or it originates elsewhere but it is a clever and particularly clear way of seeing that logic can only carry one so far in the world.

Following Kaplan, we’ll start with a review of Russell’s paradox.  The paradox arises in naive set theory as a function of three basic ideas.  First is the notion of a set as a collection of any objects of our perception or of our thought.  Second is the idea that set composition is unlimited in scope, concrete or intangible, real or imagined, localized or spread over time and space.  Kaplan calls this idea Rule Number 1 – Unrestricted Composition.  Third is the idea that what matters for a set is what elements contains not how those elements are labeled.  Set membership can be determined by a specific listing or by specifying some comprehension rule that describes the members globally.  Kaplan calls this idea Rule Number 2 – Set Identity is Determined by Membership.

Using Rules 1 and 2, Kaplan then outlines a set of rules 3-11, each springing in some way from Rules 1 or 2 as a parent, which is indicated inside the parentheses:

  • Rule 3 – Order Doesn’t Matter (Rule 2)
  • Rule 4 – Repeats Don’t Change Anything (Rule 2)
  • Rule 5 – Description Doesn’t Matter (Rule 2)
  • Rule 6 – The Union of Any Sets is a Set (Rule 1)
  • Rule 7 – Any Subset is a Set (Rule 1)
  • Rule 8 – A Set Can Have Just One Member (Rule 1)
  • Rule 9 – A Set Can Have No Members (Rules 1 & 2)
  • Rule 10 – You Can Have Sets of Sets (Rule 1)
  • Rule 11 – Sets Can Contain Themselves (Rule 1)

Kaplan walks the viewer, in an amusing way, through increasing more complex set construction examples using these 11 rules, although I’ll modify the elements used in the examples to be more to my liking.  His construction starts with an example of a finite, listed set:

\[ A = \{ \mathrm{Frodo}, \mathrm{Sam}, \mathrm{Merry}, \mathrm{Pippin} \} \;  .\]

 Employing Rule 2 allows us to rewrite set $A$ in an equivalent way as

\[ A = \{ x | x \mathrm{\;is\;a\;hobbit\;in\;the\;Fellowship\;of\;the\;Ring} \} \; .\]

Kaplan then gives the famous example of a set used thought by Gottlob Frege and Bertrand Russell as a candidate for the fundamental definition of what the ordinal number 1 is:

\[ \{ x | x \mathrm{\;of\;singleton\;sets} \} \; .\]

Here we have a set, that if listed explicitly, might start as

\[ \{ \{\mathrm{\scriptsize Frodo}\}, \{\mathrm{\scriptsize Sam}\}, \{\mathrm{\scriptsize Merry}\}, \{\mathrm{\scriptsize Pippin}\}, \{\mathrm{\scriptsize Chrysler\;Building}\} \ldots \} \; .\]

All of this seems plausible if not particularly motivated, but the wheels fall off when we look at Rule 11.  That rule is deceptively simple in that it is easy to say the words ‘sets can contain themselves’ but is ultimately difficult, perhaps impossible, to understand what those words mean as there exists no constructive way to actually build a set that contains itself.  Membership is done by a comprehension rule expressed as a sentence; any sentence will do and quoting Kaplan: “If you can think of it, you can throw it in a set.”  We’ll return to that sentiment next month when we talk about Kaplan’s language-based analog to Russell’s paradox.  We’ll call such a Rule-11 set an extraordinary set and any set not containing itself ordinary.  The next step is then to define the set-of-all-extraordinary sets as

\[ \{ x | x \mathrm{\;is\;a\;set\;that\;contains\;itself} \} \; .\]

This set, while having no constructive way of being listed, is still considered a valid set by naive set theory.  As preposterous as this set’s existence may be it generates no paradoxes.  The real heartbreaker is the set defined as

\[ \{ x | x \mathrm{\;is\;a\;set\;that\;does\;not\;contains\;itself} \} \; .\]

This set of all ordinary sets has no well-defined truth value.  If we assume it does contain itself then it must obey the comprehension rule ‘does not contain itself’ so then it doesn’t contain itself.  Alternatively, if we assume it does not contain itself then, in order to be inside itself, it must be true that it meets the inclusion criterion that it doesn’t contain itself.  And, thus, the paradox is born.

Mathematicians have patched set theory by ‘changing the rules’ (as Kaplan puts it).  They developed different systems wherein Rule 11 or some equivalent is expressly forbidden (e.g., Zermelo-Fraenkel with the Axiom of Choice). 

But, Kaplan objects, the rules of set theory are not made up but are objective rules that reflect the object, common rules of thought and language that we all use.  He gives a linguistic argument based on the act of predication in natural language to make this point.  That analysis is the subject of next month’s posting.

Neural Networks 3 – Fooling a Neural Network

Having looked at some of the positive aspects of the neural net, its ability to classify images and its relative simplicity of implementation, it is now time to look at some of the vulnerabilities and short comings of the neural net – and why fears of a future filled with Skynet or the machines from the Matrix are still many centuries off.  Along the way, we may also gain some insight into how neural networks are, contrary to some claims, not at all how people and their brains work.

An excellent place to start our journey is with the succinct observation made by Dave Gershgorn in his article Fooling the machine for Popular Science in which he said

[a]n accelerating field of research suggests that most of the artificial intelligence we’ve created so far has learned enough to give a correct answer, but without truly understanding the information. And that means it’s easy to deceive.

Since the article dates from 2016, Gershgorn could only present the then state-of-the-art examples of adversarial attacks on neural nets.  He draws heavily from the work of Christian Szegedy and others with two of the most relevant papers being Explaining and Harnessing Adversarial Examples (Goodfellow, Shlens, and Szegedy) and Intriguing properties of neural networks (Szegedy et al).

Both papers deal with the attempts to trick a trained neural net through the use of what looks to be speckle noise but which is really a small but deterministic perturbation to the image feature vector in a direction in the image space that turns the net’s classification from being waveringly correct to dogmatically wrong.

Perhaps the most famous example of this is the panda-to-gibbon example shown below (taken from the first paper)

To understand what is happening and how the addition of a ‘speckle noise’ image could persuade the net that the image it was 58% sure was a panda should now be interpreted, with nearly 100% certainty, as a gibbon, we need to remember just how the neural net does its classifications.  Each of the above images, which we perceive as 2-dimensional (a bit more on this later) are actually serialized into a one-dimensional feature vector.  In the case above, judging by the perturbation (i.e., a count of the colored pixels along an edge), each image is about 240×240 pixels.  The relationship displayed in the figure above should really be interpreted as the linear algebra relationship

\[ {\vec P} + {\vec N} = {\vec G} \; , \]

where ${\vec P}$ is the displacement vector from the origin to the location of the original image in the panda-classification region, ${\vec G}$ is the location of the perturbed image, which is now somewhere in the gibbon-classification region, and ${\vec N} = 0.007 \nabla J$ is the displacement vector that connects the two.     

Since each of these vectors is in a $240^2 = 57,600$ dimensional space, our intuitions about concepts such as ‘near’ and ‘far’ poorly apply and the gibbon-region may be distinct and far away in some directions and quite close in others.  Th structure of ${\vec N}$ (the middle image) was determined by experimenting with the net in order to find the “fastest” direction that nudges ${\vec P}$ out of the panda classification region and into the gibbon-classification region.  The term ‘nudge’ is quite descriptive because the gain on the image was small (0.007) and the presence in the final image is not discernable by the human eye.

Of course, a normal person would still classify the perturbed image as a panda even if the gain on the perturbation were large enough for the human eye to perceive since we have other faculties at play.  But the neural net is, put frankly, stupid.  Gershgorn quotes one of Szegedy’s coauthors, Ian Goodfellow, as saying

these neural networks aren’t really learning certain ideas, but just how to recognize when they [find] the right idea

which is a subtle but important point.  No neural network can know that it is correct or understand why its answer is the right one, it simply responds to the metaphoric ‘atta-boys’ it has gotten to return with what the majority of us tell it as right.

When these results were published, the danger of this type of adversarial attack was downplayed because it depends on the attacker being to play with the neural net to find the fastest direction via gradient descent (the name of this technique explains why Goodfellow, Shlens, and Szegedy denoted the perturbation as $\nabla J$).  If the net is public, that is to say one can give multiple inputs to the net in order to see how each is classified, then it is always possible to find the gradient direction that fools the network into misclassifying, even if the probabilities that the network assigns to the classification are hidden.  The article Attacking machine learning with adversarial examples by the staff at OpenAI discusses in great detail how this type of gradient attack works and how people try, and often fail, to defend against it.

The conventional thinking at the time was that if the network’s classification model were completely hidden from public scrutiny then a network could be trustworthy and therefore be reliably depended upon.  But even this conclusion was shown, shortly thereafter, to be on shakier ground than first believed.  In an article entitled Slight Street Sign Modifications Can Completely Fool Machine Learning Algorithms, Evan Ackerman points out that image classifying neural networks can be easily and completely fooled by the interplays of light and shadow on real-world traffic signs into making what would be disastrous conclusions should the network be behind the wheel of a car. 

Citing the paper, Robust Physical-World Attacks on Deep Learning Visual Classification, by Eykholt et al, Ackerman offers this image of 5 frames of a stop sign at different distances and orientations with respect to the camera

The stop signs aren’t real, but rather printed posters that have a specially chosen dappling of lighter and darker regions.  These posters fooled the neural network into believing that each frame the camera was actually looking at a Speed Limit 45 sign.  The key point here is that these authors didn’t inject the network with a feature vector tuned to the original input image.  Rather they decorated an object with a perturbation that carried out a successful attack regardless of how much of the feature vector that object consumed.  This is a real-world possible scenario and a frightening one.

This last point brings us back to an observation alluded to earlier.  Clearly, people of all mental capacities, backgrounds, and circumstances perceive the visual world quite differently than do neural networks.  We don’t serialize an image nor are we only able to receive images of a particular size; we can have our eyes wide open or half-closed changing the ‘number of pixels’ we are receiving and not just what percentage any given object consumes.   We also can change our certainty in our classification and segmentation based on self-awareness and belief and not just on ‘atta-boys’ received from the world around us.  So, sleep well knowing that the days of Skynet are far off and that you remain safe – unless you happen to find yourself in a self-driving car.

Neural Networks 1 – Overview

This month’s column begins a new effort that is inspired by and strongly follows the video Building a neural network FROM SCRATCH (no Tensorflow/Pytorch, just numpy & math) by Samson Zhang.  Like Zhang, I agree that the best way to learn any algorithm is to build one from scratch even if, and, more important, especially if you are going to use someone else’s code.  Usually, one can’t understand the behavior nor appreciate the effort that went into an external package without having ‘walked a mile in another developer’s shoes’.  You also don’t know where the algorithm is likely to work or fail if you don’t have some insight into the inner workings.

Zhang provides all of his code and data on his Kaggle notebook and that code base formed the starting point for this exploration.  However, the data that were used was from the Kaggle repository by Dariel  Dato-On at MNIST in CSV | Kaggle.  In both cases the data are derived from the MNIST digitized and labels samples of hand-written numerical digits 0-9, some of which look like:

Each image is 28×28 pixels with values ranging from 0-255.  For convenience, each image is stored in CSV format as a row consisting of 785 values.  The first column contains the label for the image (i.e., ‘0’, ‘5’, ’5’, or ‘7’ for the cases above) and the remaining 784 columns the image data.  The entire data set consists of 60,000 such rows.  For this first experiment, 59,000 were used as training data and 1,000 as test data to evaluate the success of the net in converging into some useful state.

The basic idea of a neural network is the nonlinear transformation of input arrays to output arrays that provides a high probability of classifying the input data into the correct ‘bucket’.  In this example, the input arrays are 784-length, normalized, column vectors with real values ranging $[0,1]$ and the outputs are probabilities of belonging in one of the ten possible output buckets of $[0,1,2,3,4,5,6,7,8,9]$.

Zhang’s network consists of 3-layers: an input layer consisting of 784 slots to accommodate the 784 pixels in each image; a hidden layer consisting of 10 slots, each receiving input from each slot in the previous layer (i.e., input layer), the precise mechanics of which will be discussed below; and finally an output layer consisting of 10 slots, each, again, receiving input from each slot in the previous layer (i.e., the hidden layer).  Properly normalized, the values in the output layer can be interpreted as probabilities with the maximum value corresponding to the most likely label. 

To be concrete, below is a schematic of how the final trained network decided that a ‘0’ input image should be classified as a ‘0’:

The image of the hand-written ‘0’ was converted to a one-dimensional array with 784 components, four of which are shown being fed into the input layer.  Pre-determined weights and biases, $W$ and $b$ at both layers (of different sizes, of course) create an affine transformation of the results of the previous layer

\[ Z^{(1)} = W^{(1)} \cdot X^{(0)} + b^{(1)} \; , \]

where the ‘dot product’ between the $W$ and the $X$ stands for the appropriate matrix product (i.e., $W_{ij} X_{j}$).  This transformation is then applied to each image in the training set (Zhang used 39,000 images; the one detailed here used 59,000 – with both cases holding back 1000 images for testing).  Typically, for data sets this small, the images are stacked in a single array and fed in all at once since the average cost over the whole set is easily computed with little storage overhead.  Thus, there is actually a suppressed index in Zhang’s code.  This architecture will be revisited in a latter blog for data sets too large to hold completely in memory but for now Zhang’s approach is followed as closely as possible.

The input, $X^{1}$, for the next layer results from passing these $Z^{(1)}$ outputs through a nonlinear transformation before being passed onto the next layer.  The nonlinearity represents the ‘activation’ of each neuron and the current best practice involves using the ReLU function defined as

\[ ReLU(x) = \left\{ \begin{array}{lc} x & x > 0 \\ 0 & {\mathrm otherwise} \end{array} \right. \; , \]

meaning that

\[ X^{(1)} = ReLU(Z^{(1)}) \; . \]

The dimensionality of each column of $Z^{(1)}$ is now 10, although different numbers of neurons in the hidden layer are also possible.  This is another point deferred to a later blog.

The transformation from the hidden layer to the output layer works in much the same way with

\[ Z^{(2)} = W^{(2)} \cdot X^{(1)} + b^{(2)} \;  , \]

but the nonlinear transformation to the output layer is, in this case, the SoftMax transformation that normalizes the each of the nodes on the output layer against the whole set using

\[ X^{(2)} = \frac{\exp(Z^{(2)})}{\sum \exp(Z^{(2)})} \; , \]

resulting in a functional form reminiscent, up to the sign of the exponent, of the partition function from statistical mechanics.

Of course, the classification ability of the neural network will be poor for a set of arbitrary weights and biases.  The key element in training the net is to find a combination of these values that maximizes the performance of the net across the statistically available input – in other words, an optimization problem.

In the lingo of neural nets, this optimization problem is solved via the back-propagation process, which basically depends on the differentiability of the transformations and the implicit (or inverse) function theorems to determine the proper inputs given the desired outputs.  The starting point is the formation of the ‘residuals’ $R$ (a term used locally here) defined as

\[ R = X^{(2)} – Y \; , \]

where $Y$ is the true or correct value for the labeled image.  For the ‘0’ example above, $Y$ will look like the 10-element array

\[ Y = \left[ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 \right] \; .\]

If the label had been ‘4’, then the 1 would have been put in the corresponding spot for four, which is the fifth array element in from the left.

The equations Zhang cites for this back propagation:

\[ dZ^{[2]} = A^{[2]} – Y \; , \]

\[ dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T} \; ,\]

\[ dB^{[2]} = \frac{1}{m} \Sigma {dZ^{[2]}} \; , \]

\[dZ^{[1]} = W^{[2]T} dZ^{[2]} .* g^{[1]\prime} (z^{[1]}) \; ,\]

\[ dW^{[1]} = \frac{1}{m} dZ^{[1]} A^{[0]T} \; ,\]

and

\[ dB^{[1]} = \frac{1}{m} \Sigma {dZ^{[1]}} \; .\]

These equations clearly show the backwards chaining of derivatives via the chain rule but are quite difficult to decode as he mixes notation (for example his $A$’s are also $X$’s)  and intermixes some of the broadcasting rules he uses when employing numpy (hence the factor of $m$, which is the size of the number of rows in the training data set and the presence of the transposed $T$ notation) with those representing the connecting edges of the net.  Nonetheless, his code for the net works reasonably well with an achieved accuracy of about 85% to 89% on the test set and it forms an excellent starting point for building a neural network from the ground up.

Next month’s blog will focus on coding the nuts and bolts of a similar network from scratch with along with a proper derivation of the back propagation.

Multivariate Gaussian – Isserlis’ theorem

Last month’s post established how to compute arbitrary multivariate moments of a Gaussian distribution by coupling the desired distribution with a ‘source’ term

 \[ I = \int d^n r e^{(-r^T A r/2 + b^T r)} = e^{-b^T A^{-1} b/2} \, \frac{(2 \pi)^{(n/2)}}{\sqrt{\det{A}}} \; , \]

 for an arbitrary positive definite matrix $A$ and a multivariate quadratic form $r^T A r$.  The coupling term $b$ allows for the computation of the moments by taking derivatives with respect to $b$ and then setting $b=0$, afterwards.  For example, from the last post, the multivariate moment

 \[ E[x^4y^2] = \left. \frac{\partial^6}{\partial b_x^4 \partial b_y^2 } e^{b^T B b/2} \right|_{b=0} = 3 B_{xx}^2 B_{yy} + 12 B_{xx} B_{xy}^2 \; , \]

where $B = A^{-1}$.

Of course, these derivatives are easy to evaluate with modern computer algebra systems but calculating them by hand is prohibitively expensive in many cases and a variety of techniques were developed for short-cutting much of the labor that are worth knowing in their own right.  One such technique is Isserlis’ Theorem, which is a way of organizing the algebra so that the computation depends only on determining $A^{-1}$ and performing some combinatorics.  It is that last activity that will be the focus of this blog post.

We will be considering a general multivariate moment of the form

\[ E[x^{n_1} y^{n_2}\cdots a^{n_4} b^{n_5} \cdots] \; .\]

The first thing to note is that the total power of the moment $n = n_1 + n_2 + \cdots$ must be even in order to have a non-zero result, otherwise there will always be a residual $b$ in the result that will become null when the requirement $b=0$ is enforced.  As a result we can express the total power of the moment as $n = 2m$.  Second, the only surviving term in expansion of the exponential will be the $m^{\textrm{th}}$ term for similar reasons.  The resulting moments are always taken pairwise because of the quadratic form of the exponential, as seen above.  The only unknown is the numerical coefficient in front of the various terms. 

Since we are interested in the unique pairings, it is useful to give each variable in the statistical moment a unique label in addition to its name.  With this approach, $x^4 y^2 = x_1 x_2 x_3 x_4 y_1 y_2$.  A total power $n$ corresponds to $n$ such unique terms and $n!$ possible permutation of a list of them.  Since $n$ is even, there are $m = n/2$ pairs of consecutive terms in each list to form the $m$ individual pairs.  Since the order of the pairs doesn’t matter, we divide by $m!$ to get the unique combinations of pairs and then again by a factor of $2$ for each pair to account for the fact that a pair is symmetric as to ordering; $x_1 x_2$ is equivalent to $x_2 x_1$.  Racking all theses pieces up gives the number of terms as

\[ N = \frac{(2m)!}{m! 2^m} \; .\]

Regrouping the denominator as to associate with each term in $m!$ a value of $2$ gives

\[ N = \frac{2m (2m-1) (2m-2) \cdots }{ 2m (2m – 2) (2m – 4) \cdots} = (2m – 1) (2m – 3) \cdots \equiv (2m-1)!! \; . \]

For our test case of $E[x^4 y^2]$, $n=6$, $m = 3$ and $(2m – 1)!! = 15$.  Since $15 = 3 + 12$, that checks.

To get the specific breakdown (i.e., $15 = 3 + 12$ instead or, say, $15 = 11 + 4$) requires a bit more combinatoric work that is best explained in terms of diagrams.  For convenience, we will use angle brackets instead of $E[]$ to represent expectation since the result will be easier to read, even if each term is not as expressive.

In the first diagram

we group the $y$’s together and then form an appropriate tree of the possible combinations of $x$’s.  Note that care is taken to keep the numeric labels always increasing from left to right, eliminating double counting.  Once the enumeration is complete, we can remove the numeric labels and count the number of occurrences of $\langle x^2 \rangle$ with $\langle y^2 \rangle$.  The answer $3 \langle x^2 \rangle^2 \langle y^2 \rangle$ matches the algebraic factor $3 B_{xx}^2 B_{yy}$.

In the second diagram

we group $y_1$ with all of the possible $x$’s (four sub-diagrams) and then proceed as above.  The answer $ 12 \langle x y \rangle ^2 \langle x^2 \rangle$ matches the algebraic factor $12 B_{xx} B_{xy}^2$.  It might be tempting to think that there should be an equivalent set of four sub-diagrams with $y_2$ at the top of each tree but a bit of reflection should convince us that the presence of $y_2$ in every subterm serves to arrive at the same thing without double counting.  In other words, a simple reshuffling of the 12 terms can be made to have $y_2$ as the head of each sub-diagram instead of $y_1$.

As a final example, we consider a 3-dimensional case with the aim of calculating $E[x^2 y z^3]$.  This expression spawns 3 diagrams:

and

and

which tells us that

\[ E[x^2 y z^3] = 3 B_{xx} B_{yz} B_{zz} + 6 B_{xy} B_{xz} B_{zz} + 6 B_{xz}^2 B_{yz} \; , \]

which is confirmed by sympy computer algebra, which gave $E[x^2yz^3]: 3*Bxx*Byz*Bzz + 6*Bxy*Bxz*Bzz + 6*Bxz**2*Byz$.

Multivariate Gaussian – Part 2

This month’s blog continues the exploration begun in the last installment, which covered how to evaluate and normalize multivariate Gaussian distributions.  The focus here will be on evaluating moments of these distributions.  As a reminder, the prototype example is from Section 1.2 of Brezin’s text Introduction to Statistical Field Theory, where he posed the following toy problem for the reader:

\[ E[x^4 y^2] = \frac{1}{N} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x^4 y^2 e^{-(x^2 + xy + 2 y^2)} \; , \]

where the normalization constant $N$, the value of the integral with only the exponential term kept, is found, as described in the last post, by re-expressing the exponent of the integrand as a quadratic form $r^T A r/2$ (where $r$ presents the independent variables as a column array) and then expressing the value of the integral in terms of the determinant of the matrix $A$: $N = (2 \pi)^{(n/2)}/\sqrt{\det{A}}$.  Note: for reasons that will become apparent below, the scaling of the quadratic form was changed with the introduction of a one-half in the exponent.  Under this scaling $A \rightarrow A/2$ and $\det{A} \rightarrow \det{A}/2^n$, where $n$ is the dimensionality of the space.

To evaluate the moments, we first extend the exponent by ‘coupling’ it to an $n$-dimensional ‘source’ $b$ via the introduction of the term $b^T r$ in the exponent.  The integrand becomes:

\[ e^{-r^T A r/2 + b^T r} . \]

The strategy for manipulating this term into one we can deal with is akin to usual ‘complete the square’ approach in one dimension but generalized.  We start by defining a new variable

\[ s = r \, – A^{-1} b \; .\]

Regardless of what $b$ is (real positive, real negative, complex, etc.) the original limits on the $r_i \in (-\infty,\infty)$ carry over to $s$.  Likewise, the measures of the two variables are related by $d^ns =d^nr$.  In terms of the integral, this definition simply moves the origin of the $s$ relative to the $r$ by an amount $A^{-1}b$, which results in no change since the range is infinite.

Let’s manipulate the first term in the exponent step-by-step.  Expanding the right side gives

\[ r^T A (s + A^{-1} b) = r^T A s + r^T b \; .\]

Since $A$ is symmetric, its inverse is too, $\left( A^{-1} \right)^T = A^{-1}$ and this observation allows us to now expand the first term side, giving

\[ (s^T + b^T A^{-1} ) A s + r^T b = s^T A s + b^T s + r^T b \; . \]

Next note that $b^T r = b_i r_i = r^T b$ (summation implied) and then substitute these pieces into the original expression to get

\[ -r^T A r/2 – b^T r = -( s^T A s + b^T s + b^T r)/2 + b^T r \\ = -s^T A s + b^T ( r – s )/2  \; . \]

Finally, using the definition of $s$ the last term can be simplified to eliminate both $r$ and $s$, leaving

\[ -r^T A r = -s^T A s + b^T A^{-1} b /2 \; .\]

The original integral $I = \int d^n r \exp{(-r^T A r/2 + b^T r)}$ now becomes

\[ I = e^{-b^T A^{-1} b/2} \int d^n s \, e^{-s^T A s} = e^{-b^T A^{-1} b/2} \, \frac{(2 \pi)^{(n/2)}}{\sqrt{\det{A}}} \; . \]

This result provides us with the mechanism to evaluate any multivariate moment of the distribution by differentiating with respect to $b$ and then setting $b=0$ afterwards. 

For instance, if $r^T = [x,y]$, then the moment of $x$ with respect to the original distribution of the Brezin problem is 

\[ E[x] = \frac{1}{N} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x e^{-(x^2 + xy + 2 y^2)} \; , \]

where $N = 2\pi/\sqrt{\det{A}}$.

Our expectation is, by symmetry, that this integral is zero.  We can confirm that by noting that first that the expectation value is given by

\[ E[x;b] = \frac{\partial}{\partial b_x } \frac{1}{N} \int dx \, dy \, e^{-r^T A r + b_x x + b_y y} \\ = \frac{1}{N} \int dx \, dy \, x e^{-r^T A r + b_x x + b_y y} \; ,\]

where the notation $E[x;b]$ is adopted to remind us that we have yet to set $b=0$.

Substituting the result from above gives

 \[ E[x] = \left. \left( \frac{\partial}{\partial b_x } e^{b^T A^{-1} b/2} \right) \right|_{b=0} \\ = \left. \frac{\partial}{\partial b_x} (b^T A^{-1} b)/2 \right|_{b=0} \; , \]

 where, in the last step, we used $\left. e^{b^T A^{-1} b} \right|_{b=0} = 1$.

At this point it is easier to switch to index notation when evaluating the derivative to get

\[ E[x] = \left. \frac{\partial}{\partial b_x} (b_i (A^{-1})_{ij} b_j/2) \right|_{b=0} = \left. (A^{-1})_{xj} b_j \right|_{b=0} = 0 \; ,\]

where rationale for the scaling by $1/2$ now becomes obvious.

At this point, we are ready to tackle the original problem.  For now, we will use a symbolic algebra system to calculate the required derivatives, deferring until the next post some of the techniques needed to simplify and generalize these steps.  In addition, for notational simplicity, we assign $B = A^{-1}$. 

Brezin’s problem requires that

\[ A = \left[ \begin{array}{cc} 2 & 1 \\ 1 & 4 \end{array} \right] \; \]

and that

\[ B = A^{-1} = \frac{1}{7} \left[  \begin{array}{cc} 4 & -1 \\ -1 & 2 \end{array} \right] \; .\]

The required moment is related to the derivatives of $b^T B b$ as

\[ E[x^4y^2] = \left. \frac{\partial^6}{\partial b_x^4 \partial b_y^2 } e^{b^T B b/2} \right|_{b=0} = 3 B_{xx}^2 B_{yy} + 12 B_{xx} B_{xy}^2 \; . \]

Substituting from above yields

\[ E[x^4 y^2] = 3 \left( \frac{4}{7} \right)^2 \frac{2}{7} + 12 \frac{4}{7} \left( \frac{-1}{7} \right)^2 \; , \]

which evaluates to

\[ E[x^4 y^2] = \frac{96}{343} + \frac{48}{343} = \frac{144}{343} \; . \]

 A brute force Monte Carlo evaluation gives the estimate $E[x^4 y^2] = 0.4208196$ compared with the exact decimal representation of $0.4198251$, a $0.24\%$ error, which is in excellent agreement.

Multivariate Gaussian – Part 1

In this month’s blog, we explore some of the statistical results from multivariate Gaussian distributions.  Typically, many of these results were developed in support of particular problems in science and engineering, but they stand on their own in that the results flow from the logic of statistical thinking that can be cleanly separated from the application that birthed them. 

The Monte Carlo method, which was the subject in a previous run of blogs (Monte Carlo Integration, Part 1 to Part 5), will be used here as an additional check on the techniques developed, but there will be no emphasis on the mechanics of these techniques.

For a typical, motivating example, consider the expectation value presented in Section 1.2 of Brezin’s text Introduction to Statistical Field Theory, which asks the reader to compute 

\[ E[x^4 y^2] = \frac{1}{N} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x^4 y^2 e^{-(x^2 + xy + 2 y^2)} \; , \]

where the normalization constant $N$ is given by

\[ N = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(x^2 + xy + 2 y^2)} \; . \]

Examples like these arise in any discipline where the model describing a system depends on several (perhaps a huge number of) parameters that manifest themselves as inter-correlated variables that we can measure.  The cross-correlation between the variables $x$ and $y$ is displayed by the $xy$ term in the exponential.

The first task in determining the expectation value is to evaluate the normalization constant.  The general method for solving this task will be the focus of this installment. 

The method for integrating Gaussian integrals in one dimension is well known (see, e.g., here) but the approach for multivariate distributions is not as common.  The first point is to look at the quadratic form in the argument of the exponential as being built as the inner product between an array presenting the variables $x$ and $y$ and some matrix $A$.  In the present case this is done by recognizing

\[ x^2 + xy + 2 y^2 = \left[ \begin{array}{cc} x & y \end{array} \right] \left[ \begin{array}{cc} 1 & 1/2 \\ 1/2 & 2 \end{array} \right] \left[ \begin{array}{c} x \\ y \end{array} \right]  \; . \]

Looking ahead to higher dimensional cases, we can see that this expression can be written more generally and compactly as $r^T A r \equiv r_i A_{ij} r_j$, where $r$ is a column array presenting the independent variable of the distribution (now imagined to be $n$ in number) and $A$ is a real $n \times n$ matrix; the superscript $T$ indicates that the transpose of the original array is being used. 

With this new notation, any $n$-dimensional Gaussian distribution can now be written as

\[ e^{-(r^T A r)}  \; . \]

Since this expression is symmetric in $r$, $A$ can be taken to be symmetric without loss of generality.  This observation guarantees that $A$ can be diagonalized by an orthogonal matrix $M$ such that $M^T A M = D$, $M^T M = M M^T = 1$, and $D$ is a diagonal matrix.  This, in turn, means that the quadratic form can be rewritten as

\[ r^T A r = r^T (M M^T) A (M M^T) r = \\ (r^T M) (M^T A M) (M r) = (r^T M) D (M^T r) \; . \]

The next step is to define a new set of variables defined by $s = M^T r$, which then decouples the multivariate distribution into a product of $D$ independent one-dimensional Gaussians:

\[ e^{ – (r^T A r) } = e^{ – (s^T D s) } = \prod_{i=1}^{n} e^{-d_i s_i^2} \; , \]

where the $\{d_i\}$ are the individual, on-diagonal elements of $D$ and the $\{s_i\}$.  For the integral to exist (i.e., converge), we need one additional stipulation: that all of the $d_i > 0$.  Said differently, we require that $A$ have positive eigenvalues (i.e., be positive definite).

The final step is to recognize that the measure of the integral changes in a convenient way when we make the substitution $r = M^T s$, since

\[ d^n r = det(M^T) d^n s = d^n s  \; , \]

where we’ve exploited the fact that, since $M$ is orthogonal, its determinant is 1.

Putting all of these pieces together we now have

\[ \int_{R^n} d^n e^{-(r^T A r} = \int_{R^n} d^n s \prod_{i=1}^n e^{-d_i s_i^2} = \prod_{i=1}^{n} \int_{-\infty}^\infty ds_i e^{-d_i s_i^2} \; .\]

Since the right-hand side is a product of independent Gaussian integrals, it simplifies to

\[\prod_{i=1}^{n} \int_{-\infty}^\infty ds_i e^{-d_i s_i^2} = \prod_{i=1}^{n} \sqrt{\frac{\pi}{d_i}} = \frac{\pi^{n/2}}{\sqrt{d_1 \cdots d_n}} \; . \]

The denominator of the last expression is the determinant of the $D$, which is also the determinant of $A$.  So, the final formula is

\[ \int_{R^n} d^n r e^{-(r^T A r)} = \frac{\pi^{n/2}}{\sqrt{det(A)}} \; .\]

Returning to the original example, the formula tells us that, since $det(A) = 7/4$, then the normalization constant $N = 2\pi/\sqrt(7) \approx 2.37482$.  We can confirm this using brute force Monte Carlo by uniformly sampling across the ranges $x \in [-6,6]$ and $y \in [-12,12]$.  The limits of these ranges were selected by going out roughly to 6 times the respective eigenvalue ($d_x = 0.79$, $d_y = 2.21$), but the estimated integral is insensitive to fine details as long as the computational domain is sufficiently large.  That said, there is an art to picking the computational domain, and sophisticated pieces of software are designed to adjust to the integrand.  Implementing the brute force Monte Carlo in python gave an estimate of $N = 2.37661$ for 10 million trials, an error of less than $0.1\%$.

As a last sample, suppose we provide a bigger matrix given by:

\[ A = \left[ \begin{array}{cccc} 3 & 1 & 5 & 3\\ 1 & 5 & 2 & 0\\ 5 &  2 & 10 & 3 \\ 3 & 0 & 3 & 14 \end{array} \right] \; .\]

The Gaussian integral is

\[ N = \int_{R^n} d^n r \,  e^{-(3w^2 + 2wx + 10wy + 6w^2 + 5x^2 + 4xy + 10y^2 + 6yz + 14z^2)} \; .\]

The eigenvalues of $A$ are: $\{17.63,0.26,9.85,4.26\}$.  Application of the formula gives $N = 0.70497$ compared with a brute force Monte Carlo estimate of $0.71061$.

With this operation well in hand, next month’s post will focus on calculating various moments of these distributions.

Hiding in Second Sight

One of the persistent mysteries surrounding that art form that is called the mystery is exactly how to write a good one. Of course, there’s no denying that a command of the English language is essential.  So too is the ability to create such engaging characters that we want don’t want the mystery to end.  Indeed, the masters, such as Arthur Conan Doyle, G.K. Chesterton, Rex Stout, and Agatha Christie created such enduring characters that we enjoy visiting them again and again regardless of the crime underfoot.  And having a victim, either totally sympathetic or entirely loathsome, is important so that we can either enjoy righteous justice being meted out or secretly delight in someone getting what’s coming to him.  But foremost is the requirement, agreed to by tacit contract by the writer with the reader, that the detective’s solution be a satisfying explanation with surprises and ‘aha’ moments that nonetheless ties up all the loose ends.  As Chesterton puts it, the writer should provide a simple explanation of something that only appears complex; an explanation that doesn’t, itself, need an explanation.

There are several, classic ways that authors employ to add complexity to a mystery, to obscure what’s going on.  These include the red herring, in which an event or a set of clues lead nowhere and only serve as dust to cover the real trail.  Another classic approach is to provide so many extraneous details, typically by indulging in sumptuous descriptions of surroundings or detailed appearance and behaviors of the characters, that a vital clue is lost amid noise.  And, of course, there are accusations of authors who withhold vital clues and omit important details as is hilariously summarized in the final act of the movie Murder by Death. 

Omitting vital clues or important information is commonly considered unfair; a bad show which is against the rules of the implicit contract discussed above.  And yet, there is a way to omit something from a mystery while still playing fair with the reader.  This fair-play omission occurs when the writer leaves out the second of the Three Acts of the Mind thereby leaving out a judgement or predication of the facts already revealed, which any reader with common sense can make for himself but usually doesn’t.

For those unfamiliar with what philosophers mean by the three acts of the mind, a terse summary must suffice. The first act of the mind shows that mental faculty that allows a thing to grasp what is being observed.  For example, it is the ability we all possess to look a room and distinguish a chair from the pistol resting on the seat.  In the second act of the mind, a predication about some object observed in the first act is made such as the pistol is smoking.  The third act of the mind ties the various observations and their corresponding predications together with connective reasoning to conclude something like: since the pistol is still smoking and the cause of death is a bullet wound, the murder must have committed the crime only a short time before.

Omitting the second act of the mind doesn’t violate the contract since it neither steals a simple clue from the reader’s notice nor does it hide a clue in plain sight by requiring the reader to have narrow, specialized knowledge to understand it.  Rather, the writer shares information freely but doesn’t spoon feed the proper interpretation, relying, instead on the reader to provide it for himself.

An excellent example of how to omit the second act of the mind and thereby write an engaging mystery is Edgar Allen Poe’s Murders in the Rue Morgue.

This mystery, told by an unnamed narrator, centers on the horrific deaths of Madame L’Espanaye and her daughter Mademoiselle Camille L’Espanaye.  Both victims were “frightfully mutilated” with the mother being essentially decapitated by murderer using nothing more than an ordinary straight razor while the daughter’s battered and strangled corpse was shoved, head-down, up the chimney so far that the body “could not be got down until four or five of the party united their strength.”

Poe makes the mystery even more interesting by proving scores of witnesses who heard the screams of the women as the crime was unfolding.  A subset of these followed a gendarme into the house after the latter had forced entry and arrived on the scene, an upper-floor bedroom, in time to find the daughter’s corpse still warm.  The door from the hall into the bedroom was locked from the inside, the room itself was in shambles with the furnishings thrown about, and all its window were closed and locked.  Each of the witnesses who ascended the house during the murder testified to hearing two voices coming from above in “angry contention.”  The first voice was universally agreed to be the gruff voice of a French-speaking male.  The second voice, variously described as shrill or harsh, defied classification of the witnesses.  A French man thought the voice to be speaking Italian; an English man thought it to be German; a Spaniard thought it English; and so on.

Having laid out the facts, Poe then allows the reader some space before having his detective, C. Auguste Dupin, fill in the gaps.  Dupin provides the judgement (i.e., predication) of the facts that serve as the connective tissue between the simple apprehension of the facts (e.g., two voices in angry contention) and the solution of the crime.

Several of the interpretations that Dupin provides us become clear after their revelation but none so arresting as his insight into the inability of none witness to agree on the nationality of the shrill or harsh voice.  Poe puts in marvelously in this exchange between the Dupin and the narrator:

Let me now advert – not to the whole testimony respecting these voices – but to what was peculiar in that testimony. Did you observe any thing peculiar about it?”

I remarked that, while all the witnesses agreed in supposing the gruff voice to be that of a Frenchman, there was much disagreement in regard to the shrill, or, as one individual termed it, the harsh voice.

“That was the evidence itself,” said Dupin, “but it was not the peculiarity of the evidence. You have observed nothing distinctive. Yet there was something to be observed. The witnesses, as you remark, agreed about the gruff voice; they were here unanimous. But in regard to the shrill voice, the peculiarity is – not that they disagreed – but that, while an Italian, an Englishman, a Spaniard, a Hollander, and a Frenchman attempted to describe it, each one spoke of it as that of a foreigner.

 
From this “peculiarity” in the testimony, Dupin supports the conclusion, arrived at deductively through the exercise of the third act of the mind, that the murderer was an orangutan.

This technique is not limited to Poe alone.  Consider the notable example found in Some Buried Caesar by Rex Stout.  The reader is given all the necessary facts about a bull accused on goring a man to death.  Indeed, the reader is just about spoon fed them just after the murder in a manner as plain as they would be presented in the final explanation.  Nonetheless, the solution remains surprising for many of us precisely because we don’t provide the correct predication.

While it is likely that host of authors use this technique, whether knowingly or instinctively, it is curious that the explicit connection what makes a satisfying solution to a mystery and the omission of the second act of the mind to ‘hide’ the solution in plain sight seems never to have been made.  And that is an omission we hope to correct.

Irish Humor

As discussed in other posts, humor is a delicate thing even for rational (at least sometimes) human beings let alone for a machine intelligence the grasp. Case in point, a little joke that sneakily crept up on me all I was in a pub on a beach holiday.  It was the kind of joke that at first you likely don’t get and then the light bulb goes on after a few seconds and then the subtly of the humor hits you, much like that ‘ah-ha’ moment when you finally got the Pythagorean theorem or riding a bike or some other experiential thing.


The name of the place is Shenanigans.

Given that it is an Irish Pub, it’s hardly a surprise that the dining area would stimulate its public with various pieces of trivia and witticisms. Some of these witticisms extended beyond the usual humorous ones one would expect in a conventional bar setting and bordered on wisdom.  One in particular opened vistas of thought about language, meaning, ambiguity, and humor.The witticism in question, which really caught my attention, asked the following question:

Why does a dish towel get wet when it dries?

The obvious answer, of course, depends on an equivocation in the verb to dry.In English, the action of drying can be interpreted as two distinct and opposite sequence of events. In the first, we have the passion – that is to say, the passive event – of an object as it dries out.  In this case, the object loses any water or other such liquid that it has been carrying or holding.  An excellent example of this being the ground drying after the sun has come out after a rainfall.  The object is simply bystander as it is subjected to the drying action performed by an external entity.  In the second, we have the action – that is to say, the active performance – of the entity that performs the drying action.  In the wet ground example mentioned earlier, the sun would perform the drying.  In older language that, sadly, has fallen out of favor, the sun is the drier and the ground the ‘driee’.    

Getting back to the witticism, the humor, obviously turns on the fact that the verb ‘dry’ is ambiguous in the present tense.  Consider the two sentences:

The towel dries in the sun.

and

The towel dries the dish.

The first three words of each are identical with the same subject and verb.  The only clue that the first is the passive sense and the second the active one is the inclusion of the prepositional phrase ‘in the sun’.

At first glance, one might think that indicator is sufficient and a bit of parsing here and a bit of coding with a lookup table there and one has taught a machine to tell the difference.  Then one nips out to the pint and while drinking a pint of Guinness one realizes that the next sentence can be said with proper meaning and with equal probability of being either passive or active.

The towel dries fast.

What’s a machine to do, you may ask?  Well… if it is to pass the Turing test it might randomly make an assumption and move forward or ask for clarification.  If the machine were being really ‘clever’ it might do both.