Uncategorized

To Winograd or not to Winograd

Since its inception, a common theme that has appeared frequently in this column is the nuances and ambiguities in natural language.  There are several reasons for this focus but the two most important ones are that being able to handle linguistic gray areas is a real test for machine intelligence and that by looking at how computer systems struggle with natural language processing we gain a much better appreciation how remarkable the human capacity to speak really is.

Past columns have focused mostly on equivocation is various forms, with an emphasis on humor (Irish Humor, Humorous Hedging, Yogi Berra Logic, and Nuances of Language) and context-specific inference (Teaching a Machine to Ghoti and Aristotle on Whiskey).  But the ‘kissing-cousin’ field of the Winograd Schema remained untouched because it had remained unknown.  A debt of gratitude is thus owed to Tom Scott for his engaging video The Sentences Computers Can’t Understand, But Humans Can which opened this line of research into natural language processing by machine to me.

When Hector Levesque proposed it in 2011, he designed the Winograd Schema Challenge (WSC) to address perceived deficiencies in the Turing test by presenting a challenge requiring ‘real intelligence’ rather than the application of trickery and brute force.  The accusation of trickery apparently particularly into view due to the Eugene Goostman chatbot, a system that portrayed itself as a 13-year old boy from Odesa in the Ukraine, having fooled roughly 30% of the human judges in large Turing Test competition in 2014.  To achieve this, the Wikipedia article maintains that the bot used ‘personality quirks and humor to misdirect’, which basically means that the judges were conned by the creators.  The idea of a confidence man pulling the wool over someone eyes probably never occurred to Alan Turing nor to the vast majority of computer scientists but anyone who’s seen a phishing attempt is all too familiar with that style of chicanery.

The essence of the WSC is to ask a computer intelligence to resolve a linguistic ambiguity that requires more than just a grammatical understanding of how the language works (syntax) and the meaning of the individual words and phrases (semantics).  Sadly the WSC doesn’t focus on equivocation (alas the example below will not be anything more than incidentally humorous) but rather on what linguists call an anaphora, which is the formal meaning attached to an expression whose meaning must be inferred from an earlier part of the sentence, paragraph, etc. 

A typical example for an anaphora involves the use of a pronoun in a sentence such as

John arrived at work on time and entered the building promptly but nobody claimed to have seen him enter.

Here the pronoun ‘him’ is the anaphora and is understood to have John as its antecedent. 

The fun of the WSC is in creating sentences in which the antecedent can only be understood contextually as the construction is ambiguous.  One of the typically examples used in explaining the challenge reads something like

The councilmen refused to grant the protestors a permit to gather in the park because they feared violence.

The question, when posed to a machine, would be to ask if the ‘they’ referred to the councilmen or the protestors.  Almost all people would find no ambiguity in that sentence because they would argue that the protestors would be the ones making the ruckus and that the councilmen, either genuinely worried about their constituents or cynically worried about their reputations (or a mix of both), would be the ones to fear what might happen.

Note that the sentence easily adapts itself to other interpretations with only the change of one word.  Consider the new sentence

The councilmen refused to grant the protestors a permit to gather in the park because they threatened violence.

Once again, a vast majority of people would now say that the ‘they’ referred to the protestors because the councilmen would not be in the position to threaten violence (although things may be changing on this front).

The idea here is that the machine would need not just the ability to analyze syntax with a parser and the ability to look up words with a dictionary but it would also need to reason and that reasoning would be broad rather than narrowly focused.  Relationships between concepts would be varied and far-ranging with the sentence

The trophy couldn’t fit into the suitcase because it was too large.

Here the ontology would center on spatial reasoning, the ideas of ‘big’ and ‘little’, and the notion that suitcases usually contain other objects. 

These types of ambiguous sentences seem to be part and parcel of day-to-day interactions.  For example, the following comes from Book 5 of the Lord of the Rings

The big orc, spear in hand, leapt after him. But the tracker, springing behind a stone, put an arrow in his eye as he ran up, and he fell with a crash. The other ran off across the valley and disappeared.

This scene, which takes place after Frodo and Sam have managed to escape the tower of Cirith Ungol, is between a large fighter orc and a smaller tracker.  Simple rules of syntax might lead the machine to believe that the ‘he’ in the second sentence would have ‘the tracker’ as its antecedent.  I doubt any human reader was fooled.

The complexity of the WSC is not limited to only two objects.  Consider the following example taken from the Commonsense Reasoning ~ Pronoun Disambiguation Problems database:

I asked Dave to get me my sweater from the other side of the yacht. While he was gone, I rested my arm on the rail over there and suddenly it gave way.

The machine is to pick from the multiple choices (a) sweater (b) yacht (c) arm (d) rail.  Again, I doubt that any person, of sufficient age, would be confused by this example and so wonder why.  We are literally surrounded with ambiguous expressions every day.  Casual speech thrives on these corner-cutting measures.  Even our formal writings are not immune; there are numerous examples of these types of anaphoras in all sorts of literature with the epistles of St. Paul ranking up there for complexity and frequency.  Humans manage quite nicely to deal with them – a tribute to the innate intelligence of the species as a whole.

But knowing humans to be intelligent and observing them being able to deal with these types of ambiguity does not mean the converse if true.  Being able to pass the WSC does not mean the agent is necessarily smart.  The argument for this conclusion comes, disappointingly, from the fact that it has already been overcome by various algorithms 7 years after its proposal.  Sampling the associated papers, the reader will soon find that much of the magic comes from a different flavor of statistical association, indicating that the real intellect resides in the algorithm designer.  This is a point raised by The Defeat of the Winograd Schema Challenge by Vid Kocijan, Ernest Davis, Thomas Lukasiewicz, Gary Marcus, Leora Morgenstern.  To quote from section 4 of their paper:

The Winograd Schema Challenge as originally formulated has largely been overcome. However, this accomplishment may in part reflect flaws in its formulation and execution. Indeed, Elazar et al. (2021) argue that the success of existing models at solving WSC may be largely artifactual. They write:

We provide three explanations for the perceived progress on the WS task: (1) lax evaluation criteria, (2) artifacts in the datasets that remain despite efforts to remove them, and (3) knowledge and reasoning leakage from large training data.

In their experiments, they determined that, when the form of the task, the training regime, the training set, and the evaluation measure were modified to correct for these, the performance of existing language models dropped significantly.

At the end of the day, I think it is still safe to say that researchers are clever in finding ways to mimic intelligence in small slices of experience but that nothing still approaches the versatility and adaptability of the human mind.

Visions of Clustering

In a column from a few months ago (Fooling a Neural Network), I discussed the fact that human beings, regardless of mental acuity, background, or education, see the world in a fundamentally different way than neural networks do.  Somehow, we are able to perceive the entire two-dimensional structure of an image (ignoring for the time being the stereoscopic nature of our vision that leads to depth perception) mostly independent of the size, orientation, coloring, and lighting.  This month’s column adds another example of the remarkable capabilities of human vision when compared to a computer ‘vision’ algorithm in the form of an old algorithm with a new classification.

The Hoshen-Kopelman algorithm was introduced by Joseph Hoshen and Raoul Kopelman in their 1976 paper Percolation and Cluster Distribution. I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.  The original intent of the algorithm was to identify and label linked clusters of cells on a computational grid as a way of finding spanning clusters in the percolation problem.  A spanning cluster is a set of contiguous cells with orthogonal nearest neighbor that reach from one boundary to the opposite side.  The following image shows a computational 7×7 grid where the blue squares represent an occupied site and the white squares as vacant ones.  Within this grid is a spanning cluster with 24 linked cells that reaches from the bottom of the grid to the top and which also spans the left side of the grid to the right.

The original intention of percolation models was to study physical systems such as fluid flow through irregular porous materials like rock or the electrical conduction of random arrangement of metals in an insulating matrix.  But clearly the subject has broader applicability to various problems where the connection topology of a system of things is of interest such as a computer network or board games.  In fact, the configuration shown above would be a desirable one for the charming two-player game The Legend of Landlock.

Now the human eye, beholding the configuration above has no problem finding, with a few moments of study, that there are 4 distinct clusters: the large 24-cell one just discussed plus two 3-cell, ‘L’-shaped ones in the upper- and lower-left corners and a 2-cell straight one along the top row.  This is a simple task to accomplish for almost anyone older than a toddler.

Perhaps surprisingly, ‘teaching a machine’ (yes this old algorithm is now classified – at least according to Wikipedia – as a machine learning algorithm) to identify clusters like these is a relatively complicated undertaking requiring rigorous analysis and tedious bookkeeping.  The reasons for this speak volumes for how human visual perception differs sharply from machine vision.  However, once the machine is taught to find clusters and label them, it tirelessly and rapidly tackles problems humans would find boring and painful.

To understand how the algorithm works and to more clearly see how it differs essentially from how humans perform the same task, we follow the presentation of Harvey Gould and Jan Tobochnik from their book An Introduction to Computer Simulation Methods: Applications to Physical Systems – Part 2. The grid shown above is taken from their example in Section 12.3 cluster labeling (with some minor modifications to suit more modern approaches using Python and Jupyter). 

Assuming that the grid is held in a common 2-dimensional data structure, such as a numpy array, we can decorate the lattice with row and column labels that facilitate finding a grid cell.

We start our cluster labeling at 1 and we use a cluster labeling variable to keep track of the next available label when the current one is assigned.  The algorithm starts in the cell in the upper left corner and works its way, row by row, across and down, looking at cells immediately up and left to see if the current on is linked to cells in the previous row or column.  This local scanning may seem simple enough but generically there will be linkages later on that coalesce two distinct clusters into a new, larger, single cluster and this possibility is where the subtly arises.  This will become much clearer as we work through the example.

Since the cell at grid (0,0) is occupied, it is given the label of ‘1’ and the cluster label variable is incremented to ‘2’.  The algorithm then moves to cell (0,1).  Looking left is finds that the current cell is occupied and that it has a neighbor to the left and so it too is labeled with a cluster label ‘1’.  The scan progresses across and down the grid until we get to cell (1,5).

Note that at this point, the human eye clearly sees that the cells at (1,4) and (1,5) are linked neighbors and that the assignment of the cluster label ‘4’ to cell (1,4) was superfluous since it really is a member of the cluster with label ‘3’ but the computer isn’t able to see globally. 

The best we can do with the machine is to locally detect when this happens and keep a ‘second set of books’ in the form of an array that lets the algorithm know that cluster label ‘4’ is really a ‘3’.  Gould and Tobochnik call the cluster label ‘4’ as improper and the label ‘3’ as proper but raw and proper are better terms.

When the algorithm has finished scanning the entire grid, the raw cluster labels stand as shown in the following image.

Note that the largest cluster label assigned is ‘9’ even though there are only 4 clusters.  The association between the raw labels and the proper ones is given by:

{1: 1, 2: 2, 3: 3, 4: 3, 5: 3, 6: 3, 7: 7, 8: 3, 9: 3}

Any raw labels that are identical with their proper labels are distinct and proper with no additional work.  Those raw labels that are larger in value than their proper labels are part of the cluster with the smaller label.  The final step for this example is cosmetic and involves relabeling the raw labels ‘4’, ‘5’, ‘6’, ‘8’, and ‘9’ as ‘3’ and then renumbering the ‘7’ label as ‘4’.  All of this is done with some appropriate lookup tables.

One thing to note.  Because ‘8’ linked to ‘3’ in row 5, ‘8’ had 3 as a proper label before ‘9’ linked to ‘8’ and thus, when that linkage was discovered, the algorithm code immediately assigned a proper label of ‘3’ to ‘9’.  Sometimes the linkages can be multiple layers deep.

Consider the following example.

By the time the algorithm has reached cell (5,8) it has assigned cluster label ‘10’ to cell (5,6) but has associated with it a proper label of ‘9’ based its subsequent discovery in cell (5,7).  Only when the machine visits cell (5,8) does it discover that cluster label ‘9’ is no longer proper and should now be associated with the proper label ‘6’.  Of course, the algorithm ‘knows’ this indirectly in that it has the raw label ‘10’ associated with the label ‘9’ which is now associated with the proper label ‘6’.  Thus the association between raw and proper labels is now a two-deep tree given by (key part highlighted in red):

{1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 1, 9: 6, 10: 9, 11: 11, 12: 12}

Again, a human being simply doesn’t need to go through these machinations (at least not consciously) further emphasizing the gap between artificial intelligence and the real one we all possess.

(Code for my implementation of the Hoshen-Kopelman algorithm in python can be found in the Kaggle notebook here.)

Is True Of – Part 2 Linguistic Analog

This month’s post is part 2 of 2 of the exploration of the YouTube video entitled Russell’s Paradox – a simple explanation of a profound problem by Jeffery Kaplan.  What we discussed last month was the summary of Russell’s paradox in which we found that the set of all ordinary sets defined as

\[ {\mathcal Q} = \{ x | x \mathrm{\;is\;a\;set\;that\;does\;not\;contain\;itself} \} \; \]

creates a paradox.  If we assume ${\mathcal Q}$ does not contain itself (i.e., it is ordinary) then the membership comprehension ‘is a set that does not contain itself’ instructs us that ${\mathcal Q}$ in fact does contain itself.  Alternatively, if we assume that ${\mathcal Q}$ does contain itself (i.e., it is extraordinary) then membership comprehension instructs us that it doesn’t. 

This type of self-referential paradox mirrors other well-known paradoxes that arise in linguistics such as the liar’s paradox or the concept of the self-refuting idea.  What makes Kaplan’s analysis interesting (whether or not it originates with him) is the very strong formal analogy that he draws between the common act of predication that we all engage in nearly continuously, and the more abstruse structure of Russell’s paradox that few among us know or care about.

The heart of Kaplan’s analogy is the explicit mapping of the ‘contains’ idea from set theory – that sets contain members or elements, some of which are sets, including themselves – with the ‘is true of’ idea of predication. 

To understand what Kaplan means by the ‘is true of’, we will again follow the structure of his examples with some minor verbal modifications to better suit my own taste.

Predication is the act of saying something about the subject of a sentence by giving a condition or attribute belonging to the subject.  In the following sentence

Frodo is a brave hobbit.

The subject of the sentence is “Frodo” and the predicate is “is a brave hobbit”.  The predicate “is a brave hobbit” is true of Frodo, as anyone who’s read Lord of the Rings can attest.  Kaplan then points out that the first basic rule of naïve set theory, which he states as

Rule #1 of sets: there is a set for any imaginable collection of a thing or of things”

has, as its formal analogy in predication, the following:

Rule #1 of predication: there is a predicate for any imaginable characteristic of a thing.

The two key rules of set theory that lead to Russell’s paradox have their analogs in predication as well.   

Rule #10 of sets, which allows us to have sets of sets, is mirrored by Rule #10 of predication that tells us we can predicate things about predicates.  As an example of this, consider the following sentence:

“Is a Nazgul” is a terrifying thing to hear said of someone.

The predicate “Is a Nazgul” is the subject of that sentence and “is a terrifying thing to hear said of someone” is the predicate.

Rule #11 of sets, which allows sets to contain themselves (i.e., self-reference), finds its analog in Rule #11 of predication that tells us that predicates can be true of themselves.

Here we must proceed a bit more carefully.  Let’s start with a simple counterexample:

“Is a hobbit” is a hobbit.

This sentence is clearly false as the subject, the predicate “Is a hobbit”, is clearly not a hobbit itself, it is a predicate.  But now consider the following sentence, which Kaplan offers:

“Is a predicate” is a predicate

This sentence is clearly true as the subject, the predicate “Is a predicate”, is clearly a predicate.  And, so, Rule #11 of predication works.

Kaplan then constructs a table similar to the following (again only minor verbal tweaks are done for the predicates that are not true of themselves to suit my own taste)

Predicates not true of themselvesPredicates true of themselves
“is a brave hobbit”“is a predicate”
“is a Nazgul”“is a string of words”
“keeps his oaths”“typically comes at the end of a sentence”

Note that the predicate “is true of itself” is a predicate that is true of all the predicates that are true of themselves, that is to say, of all the predicates that can be placed in the right column of the table above. The next step is then to ask what is the predicate of all the predicates that can be placed in the left column of the above table.  A little reflection should satisfy oneself that the predicate “is not true of itself” fits the bill. 

The final step is to ask in which of the two column does “is not true of itself” fall, or, in other words,

is “is not true of itself” true of itself?

If we assume that it is true of itself then the content found between the quotes tells us that it is not true of itself.  Equally vexing, if we assume that it is not true of itself, that assumption matching the content found between the quotes tells us that it is true of itself.  In summary: if it is then it isn’t and if it isn’t then it is.  And we’ve generated the predicate analogy to Russell’s paradox.

Of course, this is just a form of the well-known Liar’s Paradox, so we might be willing to just shrug it off as a quirk of language, but I think Kaplan is making a deeper point that is worth deeply considering.  At the root of his analysis is the realization that there are objective rules (or truths), that these rules generate self-referential paradoxes, and, so, one is forced to recognize that paradoxes are an essential ingredient in not just all of language but of thought itself.  And no amount of patching, such as was done to naïve set theory, can rescue us from this situation.  This observation, in turn, has the profound philosophical implication that there is only so far that logic can take us.

Is True Of – Part 1: Russell Redux

This month’s post and the next one are based on the YouTube video entitled Russell’s Paradox – a simple explanation of a profound problem by Jeffery Kaplan.  In that video, Kaplan links the well-known mathematical paradox found by Bertrand Russell with familiar linguistic paradoxes in a what a friend characterized as a “stimulating alternative perspective”. 

Russell’s paradox shook the foundations of mathematics by casting doubt on whether it were possible to logically establish mathematics as an objective discipline with rules that reflect something beyond human convention or construction.  And while, in the aftermath, a variety of patches were proposed that sidestep the issue by eliminating certain constructions, Kaplan’s essential point is that the same mental processes that lead to Russell’s paradox lead to logical paradoxes in natural language.  These natural language paradoxes, in turn, reflect something deeper in how we think and, as a result, we can’t sidestep these processes in everyday life the way they are currently and narrowly sidestepped in mathematics.  Paradoxes seem to be built into the fabric of human existence.

To be clear, his essential point is not novel and the connections that exist in human thought between formal logic, mathematical logic, and linguistics have been covered in within a variety of contexts including the liar’s paradox.  What is intriguing about Kaplan’s analysis is the method he employs to make this point using the basic function of predication within natural language.  I don’t know if his argument is truly one of his own making or it originates elsewhere but it is a clever and particularly clear way of seeing that logic can only carry one so far in the world.

Following Kaplan, we’ll start with a review of Russell’s paradox.  The paradox arises in naive set theory as a function of three basic ideas.  First is the notion of a set as a collection of any objects of our perception or of our thought.  Second is the idea that set composition is unlimited in scope, concrete or intangible, real or imagined, localized or spread over time and space.  Kaplan calls this idea Rule Number 1 – Unrestricted Composition.  Third is the idea that what matters for a set is what elements contains not how those elements are labeled.  Set membership can be determined by a specific listing or by specifying some comprehension rule that describes the members globally.  Kaplan calls this idea Rule Number 2 – Set Identity is Determined by Membership.

Using Rules 1 and 2, Kaplan then outlines a set of rules 3-11, each springing in some way from Rules 1 or 2 as a parent, which is indicated inside the parentheses:

  • Rule 3 – Order Doesn’t Matter (Rule 2)
  • Rule 4 – Repeats Don’t Change Anything (Rule 2)
  • Rule 5 – Description Doesn’t Matter (Rule 2)
  • Rule 6 – The Union of Any Sets is a Set (Rule 1)
  • Rule 7 – Any Subset is a Set (Rule 1)
  • Rule 8 – A Set Can Have Just One Member (Rule 1)
  • Rule 9 – A Set Can Have No Members (Rules 1 & 2)
  • Rule 10 – You Can Have Sets of Sets (Rule 1)
  • Rule 11 – Sets Can Contain Themselves (Rule 1)

Kaplan walks the viewer, in an amusing way, through increasing more complex set construction examples using these 11 rules, although I’ll modify the elements used in the examples to be more to my liking.  His construction starts with an example of a finite, listed set:

\[ A = \{ \mathrm{Frodo}, \mathrm{Sam}, \mathrm{Merry}, \mathrm{Pippin} \} \;  .\]

 Employing Rule 2 allows us to rewrite set $A$ in an equivalent way as

\[ A = \{ x | x \mathrm{\;is\;a\;hobbit\;in\;the\;Fellowship\;of\;the\;Ring} \} \; .\]

Kaplan then gives the famous example of a set used thought by Gottlob Frege and Bertrand Russell as a candidate for the fundamental definition of what the ordinal number 1 is:

\[ \{ x | x \mathrm{\;of\;singleton\;sets} \} \; .\]

Here we have a set, that if listed explicitly, might start as

\[ \{ \{\mathrm{\scriptsize Frodo}\}, \{\mathrm{\scriptsize Sam}\}, \{\mathrm{\scriptsize Merry}\}, \{\mathrm{\scriptsize Pippin}\}, \{\mathrm{\scriptsize Chrysler\;Building}\} \ldots \} \; .\]

All of this seems plausible if not particularly motivated, but the wheels fall off when we look at Rule 11.  That rule is deceptively simple in that it is easy to say the words ‘sets can contain themselves’ but is ultimately difficult, perhaps impossible, to understand what those words mean as there exists no constructive way to actually build a set that contains itself.  Membership is done by a comprehension rule expressed as a sentence; any sentence will do and quoting Kaplan: “If you can think of it, you can throw it in a set.”  We’ll return to that sentiment next month when we talk about Kaplan’s language-based analog to Russell’s paradox.  We’ll call such a Rule-11 set an extraordinary set and any set not containing itself ordinary.  The next step is then to define the set-of-all-extraordinary sets as

\[ \{ x | x \mathrm{\;is\;a\;set\;that\;contains\;itself} \} \; .\]

This set, while having no constructive way of being listed, is still considered a valid set by naive set theory.  As preposterous as this set’s existence may be it generates no paradoxes.  The real heartbreaker is the set defined as

\[ \{ x | x \mathrm{\;is\;a\;set\;that\;does\;not\;contains\;itself} \} \; .\]

This set of all ordinary sets has no well-defined truth value.  If we assume it does contain itself then it must obey the comprehension rule ‘does not contain itself’ so then it doesn’t contain itself.  Alternatively, if we assume it does not contain itself then, in order to be inside itself, it must be true that it meets the inclusion criterion that it doesn’t contain itself.  And, thus, the paradox is born.

Mathematicians have patched set theory by ‘changing the rules’ (as Kaplan puts it).  They developed different systems wherein Rule 11 or some equivalent is expressly forbidden (e.g., Zermelo-Fraenkel with the Axiom of Choice). 

But, Kaplan objects, the rules of set theory are not made up but are objective rules that reflect the object, common rules of thought and language that we all use.  He gives a linguistic argument based on the act of predication in natural language to make this point.  That analysis is the subject of next month’s posting.

Neural Networks 3 – Fooling a Neural Network

Having looked at some of the positive aspects of the neural net, its ability to classify images and its relative simplicity of implementation, it is now time to look at some of the vulnerabilities and short comings of the neural net – and why fears of a future filled with Skynet or the machines from the Matrix are still many centuries off.  Along the way, we may also gain some insight into how neural networks are, contrary to some claims, not at all how people and their brains work.

An excellent place to start our journey is with the succinct observation made by Dave Gershgorn in his article Fooling the machine for Popular Science in which he said

[a]n accelerating field of research suggests that most of the artificial intelligence we’ve created so far has learned enough to give a correct answer, but without truly understanding the information. And that means it’s easy to deceive.

Since the article dates from 2016, Gershgorn could only present the then state-of-the-art examples of adversarial attacks on neural nets.  He draws heavily from the work of Christian Szegedy and others with two of the most relevant papers being Explaining and Harnessing Adversarial Examples (Goodfellow, Shlens, and Szegedy) and Intriguing properties of neural networks (Szegedy et al).

Both papers deal with the attempts to trick a trained neural net through the use of what looks to be speckle noise but which is really a small but deterministic perturbation to the image feature vector in a direction in the image space that turns the net’s classification from being waveringly correct to dogmatically wrong.

Perhaps the most famous example of this is the panda-to-gibbon example shown below (taken from the first paper)

To understand what is happening and how the addition of a ‘speckle noise’ image could persuade the net that the image it was 58% sure was a panda should now be interpreted, with nearly 100% certainty, as a gibbon, we need to remember just how the neural net does its classifications.  Each of the above images, which we perceive as 2-dimensional (a bit more on this later) are actually serialized into a one-dimensional feature vector.  In the case above, judging by the perturbation (i.e., a count of the colored pixels along an edge), each image is about 240×240 pixels.  The relationship displayed in the figure above should really be interpreted as the linear algebra relationship

\[ {\vec P} + {\vec N} = {\vec G} \; , \]

where ${\vec P}$ is the displacement vector from the origin to the location of the original image in the panda-classification region, ${\vec G}$ is the location of the perturbed image, which is now somewhere in the gibbon-classification region, and ${\vec N} = 0.007 \nabla J$ is the displacement vector that connects the two.     

Since each of these vectors is in a $240^2 = 57,600$ dimensional space, our intuitions about concepts such as ‘near’ and ‘far’ poorly apply and the gibbon-region may be distinct and far away in some directions and quite close in others.  Th structure of ${\vec N}$ (the middle image) was determined by experimenting with the net in order to find the “fastest” direction that nudges ${\vec P}$ out of the panda classification region and into the gibbon-classification region.  The term ‘nudge’ is quite descriptive because the gain on the image was small (0.007) and the presence in the final image is not discernable by the human eye.

Of course, a normal person would still classify the perturbed image as a panda even if the gain on the perturbation were large enough for the human eye to perceive since we have other faculties at play.  But the neural net is, put frankly, stupid.  Gershgorn quotes one of Szegedy’s coauthors, Ian Goodfellow, as saying

these neural networks aren’t really learning certain ideas, but just how to recognize when they [find] the right idea

which is a subtle but important point.  No neural network can know that it is correct or understand why its answer is the right one, it simply responds to the metaphoric ‘atta-boys’ it has gotten to return with what the majority of us tell it as right.

When these results were published, the danger of this type of adversarial attack was downplayed because it depends on the attacker being to play with the neural net to find the fastest direction via gradient descent (the name of this technique explains why Goodfellow, Shlens, and Szegedy denoted the perturbation as $\nabla J$).  If the net is public, that is to say one can give multiple inputs to the net in order to see how each is classified, then it is always possible to find the gradient direction that fools the network into misclassifying, even if the probabilities that the network assigns to the classification are hidden.  The article Attacking machine learning with adversarial examples by the staff at OpenAI discusses in great detail how this type of gradient attack works and how people try, and often fail, to defend against it.

The conventional thinking at the time was that if the network’s classification model were completely hidden from public scrutiny then a network could be trustworthy and therefore be reliably depended upon.  But even this conclusion was shown, shortly thereafter, to be on shakier ground than first believed.  In an article entitled Slight Street Sign Modifications Can Completely Fool Machine Learning Algorithms, Evan Ackerman points out that image classifying neural networks can be easily and completely fooled by the interplays of light and shadow on real-world traffic signs into making what would be disastrous conclusions should the network be behind the wheel of a car. 

Citing the paper, Robust Physical-World Attacks on Deep Learning Visual Classification, by Eykholt et al, Ackerman offers this image of 5 frames of a stop sign at different distances and orientations with respect to the camera

The stop signs aren’t real, but rather printed posters that have a specially chosen dappling of lighter and darker regions.  These posters fooled the neural network into believing that each frame the camera was actually looking at a Speed Limit 45 sign.  The key point here is that these authors didn’t inject the network with a feature vector tuned to the original input image.  Rather they decorated an object with a perturbation that carried out a successful attack regardless of how much of the feature vector that object consumed.  This is a real-world possible scenario and a frightening one.

This last point brings us back to an observation alluded to earlier.  Clearly, people of all mental capacities, backgrounds, and circumstances perceive the visual world quite differently than do neural networks.  We don’t serialize an image nor are we only able to receive images of a particular size; we can have our eyes wide open or half-closed changing the ‘number of pixels’ we are receiving and not just what percentage any given object consumes.   We also can change our certainty in our classification and segmentation based on self-awareness and belief and not just on ‘atta-boys’ received from the world around us.  So, sleep well knowing that the days of Skynet are far off and that you remain safe – unless you happen to find yourself in a self-driving car.

Neural Networks 1 – Overview

This month’s column begins a new effort that is inspired by and strongly follows the video Building a neural network FROM SCRATCH (no Tensorflow/Pytorch, just numpy & math) by Samson Zhang.  Like Zhang, I agree that the best way to learn any algorithm is to build one from scratch even if, and, more important, especially if you are going to use someone else’s code.  Usually, one can’t understand the behavior nor appreciate the effort that went into an external package without having ‘walked a mile in another developer’s shoes’.  You also don’t know where the algorithm is likely to work or fail if you don’t have some insight into the inner workings.

Zhang provides all of his code and data on his Kaggle notebook and that code base formed the starting point for this exploration.  However, the data that were used was from the Kaggle repository by Dariel  Dato-On at MNIST in CSV | Kaggle.  In both cases the data are derived from the MNIST digitized and labels samples of hand-written numerical digits 0-9, some of which look like:

Each image is 28×28 pixels with values ranging from 0-255.  For convenience, each image is stored in CSV format as a row consisting of 785 values.  The first column contains the label for the image (i.e., ‘0’, ‘5’, ’5’, or ‘7’ for the cases above) and the remaining 784 columns the image data.  The entire data set consists of 60,000 such rows.  For this first experiment, 59,000 were used as training data and 1,000 as test data to evaluate the success of the net in converging into some useful state.

The basic idea of a neural network is the nonlinear transformation of input arrays to output arrays that provides a high probability of classifying the input data into the correct ‘bucket’.  In this example, the input arrays are 784-length, normalized, column vectors with real values ranging $[0,1]$ and the outputs are probabilities of belonging in one of the ten possible output buckets of $[0,1,2,3,4,5,6,7,8,9]$.

Zhang’s network consists of 3-layers: an input layer consisting of 784 slots to accommodate the 784 pixels in each image; a hidden layer consisting of 10 slots, each receiving input from each slot in the previous layer (i.e., input layer), the precise mechanics of which will be discussed below; and finally an output layer consisting of 10 slots, each, again, receiving input from each slot in the previous layer (i.e., the hidden layer).  Properly normalized, the values in the output layer can be interpreted as probabilities with the maximum value corresponding to the most likely label. 

To be concrete, below is a schematic of how the final trained network decided that a ‘0’ input image should be classified as a ‘0’:

The image of the hand-written ‘0’ was converted to a one-dimensional array with 784 components, four of which are shown being fed into the input layer.  Pre-determined weights and biases, $W$ and $b$ at both layers (of different sizes, of course) create an affine transformation of the results of the previous layer

\[ Z^{(1)} = W^{(1)} \cdot X^{(0)} + b^{(1)} \; , \]

where the ‘dot product’ between the $W$ and the $X$ stands for the appropriate matrix product (i.e., $W_{ij} X_{j}$).  This transformation is then applied to each image in the training set (Zhang used 39,000 images; the one detailed here used 59,000 – with both cases holding back 1000 images for testing).  Typically, for data sets this small, the images are stacked in a single array and fed in all at once since the average cost over the whole set is easily computed with little storage overhead.  Thus, there is actually a suppressed index in Zhang’s code.  This architecture will be revisited in a latter blog for data sets too large to hold completely in memory but for now Zhang’s approach is followed as closely as possible.

The input, $X^{1}$, for the next layer results from passing these $Z^{(1)}$ outputs through a nonlinear transformation before being passed onto the next layer.  The nonlinearity represents the ‘activation’ of each neuron and the current best practice involves using the ReLU function defined as

\[ ReLU(x) = \left\{ \begin{array}{lc} x & x > 0 \\ 0 & {\mathrm otherwise} \end{array} \right. \; , \]

meaning that

\[ X^{(1)} = ReLU(Z^{(1)}) \; . \]

The dimensionality of each column of $Z^{(1)}$ is now 10, although different numbers of neurons in the hidden layer are also possible.  This is another point deferred to a later blog.

The transformation from the hidden layer to the output layer works in much the same way with

\[ Z^{(2)} = W^{(2)} \cdot X^{(1)} + b^{(2)} \;  , \]

but the nonlinear transformation to the output layer is, in this case, the SoftMax transformation that normalizes the each of the nodes on the output layer against the whole set using

\[ X^{(2)} = \frac{\exp(Z^{(2)})}{\sum \exp(Z^{(2)})} \; , \]

resulting in a functional form reminiscent, up to the sign of the exponent, of the partition function from statistical mechanics.

Of course, the classification ability of the neural network will be poor for a set of arbitrary weights and biases.  The key element in training the net is to find a combination of these values that maximizes the performance of the net across the statistically available input – in other words, an optimization problem.

In the lingo of neural nets, this optimization problem is solved via the back-propagation process, which basically depends on the differentiability of the transformations and the implicit (or inverse) function theorems to determine the proper inputs given the desired outputs.  The starting point is the formation of the ‘residuals’ $R$ (a term used locally here) defined as

\[ R = X^{(2)} – Y \; , \]

where $Y$ is the true or correct value for the labeled image.  For the ‘0’ example above, $Y$ will look like the 10-element array

\[ Y = \left[ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 \right] \; .\]

If the label had been ‘4’, then the 1 would have been put in the corresponding spot for four, which is the fifth array element in from the left.

The equations Zhang cites for this back propagation:

\[ dZ^{[2]} = A^{[2]} – Y \; , \]

\[ dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T} \; ,\]

\[ dB^{[2]} = \frac{1}{m} \Sigma {dZ^{[2]}} \; , \]

\[dZ^{[1]} = W^{[2]T} dZ^{[2]} .* g^{[1]\prime} (z^{[1]}) \; ,\]

\[ dW^{[1]} = \frac{1}{m} dZ^{[1]} A^{[0]T} \; ,\]

and

\[ dB^{[1]} = \frac{1}{m} \Sigma {dZ^{[1]}} \; .\]

These equations clearly show the backwards chaining of derivatives via the chain rule but are quite difficult to decode as he mixes notation (for example his $A$’s are also $X$’s)  and intermixes some of the broadcasting rules he uses when employing numpy (hence the factor of $m$, which is the size of the number of rows in the training data set and the presence of the transposed $T$ notation) with those representing the connecting edges of the net.  Nonetheless, his code for the net works reasonably well with an achieved accuracy of about 85% to 89% on the test set and it forms an excellent starting point for building a neural network from the ground up.

Next month’s blog will focus on coding the nuts and bolts of a similar network from scratch with along with a proper derivation of the back propagation.

Multivariate Gaussian – Isserlis’ theorem

Last month’s post established how to compute arbitrary multivariate moments of a Gaussian distribution by coupling the desired distribution with a ‘source’ term

 \[ I = \int d^n r e^{(-r^T A r/2 + b^T r)} = e^{-b^T A^{-1} b/2} \, \frac{(2 \pi)^{(n/2)}}{\sqrt{\det{A}}} \; , \]

 for an arbitrary positive definite matrix $A$ and a multivariate quadratic form $r^T A r$.  The coupling term $b$ allows for the computation of the moments by taking derivatives with respect to $b$ and then setting $b=0$, afterwards.  For example, from the last post, the multivariate moment

 \[ E[x^4y^2] = \left. \frac{\partial^6}{\partial b_x^4 \partial b_y^2 } e^{b^T B b/2} \right|_{b=0} = 3 B_{xx}^2 B_{yy} + 12 B_{xx} B_{xy}^2 \; , \]

where $B = A^{-1}$.

Of course, these derivatives are easy to evaluate with modern computer algebra systems but calculating them by hand is prohibitively expensive in many cases and a variety of techniques were developed for short-cutting much of the labor that are worth knowing in their own right.  One such technique is Isserlis’ Theorem, which is a way of organizing the algebra so that the computation depends only on determining $A^{-1}$ and performing some combinatorics.  It is that last activity that will be the focus of this blog post.

We will be considering a general multivariate moment of the form

\[ E[x^{n_1} y^{n_2}\cdots a^{n_4} b^{n_5} \cdots] \; .\]

The first thing to note is that the total power of the moment $n = n_1 + n_2 + \cdots$ must be even in order to have a non-zero result, otherwise there will always be a residual $b$ in the result that will become null when the requirement $b=0$ is enforced.  As a result we can express the total power of the moment as $n = 2m$.  Second, the only surviving term in expansion of the exponential will be the $m^{\textrm{th}}$ term for similar reasons.  The resulting moments are always taken pairwise because of the quadratic form of the exponential, as seen above.  The only unknown is the numerical coefficient in front of the various terms. 

Since we are interested in the unique pairings, it is useful to give each variable in the statistical moment a unique label in addition to its name.  With this approach, $x^4 y^2 = x_1 x_2 x_3 x_4 y_1 y_2$.  A total power $n$ corresponds to $n$ such unique terms and $n!$ possible permutation of a list of them.  Since $n$ is even, there are $m = n/2$ pairs of consecutive terms in each list to form the $m$ individual pairs.  Since the order of the pairs doesn’t matter, we divide by $m!$ to get the unique combinations of pairs and then again by a factor of $2$ for each pair to account for the fact that a pair is symmetric as to ordering; $x_1 x_2$ is equivalent to $x_2 x_1$.  Racking all theses pieces up gives the number of terms as

\[ N = \frac{(2m)!}{m! 2^m} \; .\]

Regrouping the denominator as to associate with each term in $m!$ a value of $2$ gives

\[ N = \frac{2m (2m-1) (2m-2) \cdots }{ 2m (2m – 2) (2m – 4) \cdots} = (2m – 1) (2m – 3) \cdots \equiv (2m-1)!! \; . \]

For our test case of $E[x^4 y^2]$, $n=6$, $m = 3$ and $(2m – 1)!! = 15$.  Since $15 = 3 + 12$, that checks.

To get the specific breakdown (i.e., $15 = 3 + 12$ instead or, say, $15 = 11 + 4$) requires a bit more combinatoric work that is best explained in terms of diagrams.  For convenience, we will use angle brackets instead of $E[]$ to represent expectation since the result will be easier to read, even if each term is not as expressive.

In the first diagram

we group the $y$’s together and then form an appropriate tree of the possible combinations of $x$’s.  Note that care is taken to keep the numeric labels always increasing from left to right, eliminating double counting.  Once the enumeration is complete, we can remove the numeric labels and count the number of occurrences of $\langle x^2 \rangle$ with $\langle y^2 \rangle$.  The answer $3 \langle x^2 \rangle^2 \langle y^2 \rangle$ matches the algebraic factor $3 B_{xx}^2 B_{yy}$.

In the second diagram

we group $y_1$ with all of the possible $x$’s (four sub-diagrams) and then proceed as above.  The answer $ 12 \langle x y \rangle ^2 \langle x^2 \rangle$ matches the algebraic factor $12 B_{xx} B_{xy}^2$.  It might be tempting to think that there should be an equivalent set of four sub-diagrams with $y_2$ at the top of each tree but a bit of reflection should convince us that the presence of $y_2$ in every subterm serves to arrive at the same thing without double counting.  In other words, a simple reshuffling of the 12 terms can be made to have $y_2$ as the head of each sub-diagram instead of $y_1$.

As a final example, we consider a 3-dimensional case with the aim of calculating $E[x^2 y z^3]$.  This expression spawns 3 diagrams:

and

and

which tells us that

\[ E[x^2 y z^3] = 3 B_{xx} B_{yz} B_{zz} + 6 B_{xy} B_{xz} B_{zz} + 6 B_{xz}^2 B_{yz} \; , \]

which is confirmed by sympy computer algebra, which gave $E[x^2yz^3]: 3*Bxx*Byz*Bzz + 6*Bxy*Bxz*Bzz + 6*Bxz**2*Byz$.

Multivariate Gaussian – Part 2

This month’s blog continues the exploration begun in the last installment, which covered how to evaluate and normalize multivariate Gaussian distributions.  The focus here will be on evaluating moments of these distributions.  As a reminder, the prototype example is from Section 1.2 of Brezin’s text Introduction to Statistical Field Theory, where he posed the following toy problem for the reader:

\[ E[x^4 y^2] = \frac{1}{N} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x^4 y^2 e^{-(x^2 + xy + 2 y^2)} \; , \]

where the normalization constant $N$, the value of the integral with only the exponential term kept, is found, as described in the last post, by re-expressing the exponent of the integrand as a quadratic form $r^T A r/2$ (where $r$ presents the independent variables as a column array) and then expressing the value of the integral in terms of the determinant of the matrix $A$: $N = (2 \pi)^{(n/2)}/\sqrt{\det{A}}$.  Note: for reasons that will become apparent below, the scaling of the quadratic form was changed with the introduction of a one-half in the exponent.  Under this scaling $A \rightarrow A/2$ and $\det{A} \rightarrow \det{A}/2^n$, where $n$ is the dimensionality of the space.

To evaluate the moments, we first extend the exponent by ‘coupling’ it to an $n$-dimensional ‘source’ $b$ via the introduction of the term $b^T r$ in the exponent.  The integrand becomes:

\[ e^{-r^T A r/2 + b^T r} . \]

The strategy for manipulating this term into one we can deal with is akin to usual ‘complete the square’ approach in one dimension but generalized.  We start by defining a new variable

\[ s = r \, – A^{-1} b \; .\]

Regardless of what $b$ is (real positive, real negative, complex, etc.) the original limits on the $r_i \in (-\infty,\infty)$ carry over to $s$.  Likewise, the measures of the two variables are related by $d^ns =d^nr$.  In terms of the integral, this definition simply moves the origin of the $s$ relative to the $r$ by an amount $A^{-1}b$, which results in no change since the range is infinite.

Let’s manipulate the first term in the exponent step-by-step.  Expanding the right side gives

\[ r^T A (s + A^{-1} b) = r^T A s + r^T b \; .\]

Since $A$ is symmetric, its inverse is too, $\left( A^{-1} \right)^T = A^{-1}$ and this observation allows us to now expand the first term side, giving

\[ (s^T + b^T A^{-1} ) A s + r^T b = s^T A s + b^T s + r^T b \; . \]

Next note that $b^T r = b_i r_i = r^T b$ (summation implied) and then substitute these pieces into the original expression to get

\[ -r^T A r/2 – b^T r = -( s^T A s + b^T s + b^T r)/2 + b^T r \\ = -s^T A s + b^T ( r – s )/2  \; . \]

Finally, using the definition of $s$ the last term can be simplified to eliminate both $r$ and $s$, leaving

\[ -r^T A r = -s^T A s + b^T A^{-1} b /2 \; .\]

The original integral $I = \int d^n r \exp{(-r^T A r/2 + b^T r)}$ now becomes

\[ I = e^{-b^T A^{-1} b/2} \int d^n s \, e^{-s^T A s} = e^{-b^T A^{-1} b/2} \, \frac{(2 \pi)^{(n/2)}}{\sqrt{\det{A}}} \; . \]

This result provides us with the mechanism to evaluate any multivariate moment of the distribution by differentiating with respect to $b$ and then setting $b=0$ afterwards. 

For instance, if $r^T = [x,y]$, then the moment of $x$ with respect to the original distribution of the Brezin problem is 

\[ E[x] = \frac{1}{N} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x e^{-(x^2 + xy + 2 y^2)} \; , \]

where $N = 2\pi/\sqrt{\det{A}}$.

Our expectation is, by symmetry, that this integral is zero.  We can confirm that by noting that first that the expectation value is given by

\[ E[x;b] = \frac{\partial}{\partial b_x } \frac{1}{N} \int dx \, dy \, e^{-r^T A r + b_x x + b_y y} \\ = \frac{1}{N} \int dx \, dy \, x e^{-r^T A r + b_x x + b_y y} \; ,\]

where the notation $E[x;b]$ is adopted to remind us that we have yet to set $b=0$.

Substituting the result from above gives

 \[ E[x] = \left. \left( \frac{\partial}{\partial b_x } e^{b^T A^{-1} b/2} \right) \right|_{b=0} \\ = \left. \frac{\partial}{\partial b_x} (b^T A^{-1} b)/2 \right|_{b=0} \; , \]

 where, in the last step, we used $\left. e^{b^T A^{-1} b} \right|_{b=0} = 1$.

At this point it is easier to switch to index notation when evaluating the derivative to get

\[ E[x] = \left. \frac{\partial}{\partial b_x} (b_i (A^{-1})_{ij} b_j/2) \right|_{b=0} = \left. (A^{-1})_{xj} b_j \right|_{b=0} = 0 \; ,\]

where rationale for the scaling by $1/2$ now becomes obvious.

At this point, we are ready to tackle the original problem.  For now, we will use a symbolic algebra system to calculate the required derivatives, deferring until the next post some of the techniques needed to simplify and generalize these steps.  In addition, for notational simplicity, we assign $B = A^{-1}$. 

Brezin’s problem requires that

\[ A = \left[ \begin{array}{cc} 2 & 1 \\ 1 & 4 \end{array} \right] \; \]

and that

\[ B = A^{-1} = \frac{1}{7} \left[  \begin{array}{cc} 4 & -1 \\ -1 & 2 \end{array} \right] \; .\]

The required moment is related to the derivatives of $b^T B b$ as

\[ E[x^4y^2] = \left. \frac{\partial^6}{\partial b_x^4 \partial b_y^2 } e^{b^T B b/2} \right|_{b=0} = 3 B_{xx}^2 B_{yy} + 12 B_{xx} B_{xy}^2 \; . \]

Substituting from above yields

\[ E[x^4 y^2] = 3 \left( \frac{4}{7} \right)^2 \frac{2}{7} + 12 \frac{4}{7} \left( \frac{-1}{7} \right)^2 \; , \]

which evaluates to

\[ E[x^4 y^2] = \frac{96}{343} + \frac{48}{343} = \frac{144}{343} \; . \]

 A brute force Monte Carlo evaluation gives the estimate $E[x^4 y^2] = 0.4208196$ compared with the exact decimal representation of $0.4198251$, a $0.24\%$ error, which is in excellent agreement.

Multivariate Gaussian – Part 1

In this month’s blog, we explore some of the statistical results from multivariate Gaussian distributions.  Typically, many of these results were developed in support of particular problems in science and engineering, but they stand on their own in that the results flow from the logic of statistical thinking that can be cleanly separated from the application that birthed them. 

The Monte Carlo method, which was the subject in a previous run of blogs (Monte Carlo Integration, Part 1 to Part 5), will be used here as an additional check on the techniques developed, but there will be no emphasis on the mechanics of these techniques.

For a typical, motivating example, consider the expectation value presented in Section 1.2 of Brezin’s text Introduction to Statistical Field Theory, which asks the reader to compute 

\[ E[x^4 y^2] = \frac{1}{N} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x^4 y^2 e^{-(x^2 + xy + 2 y^2)} \; , \]

where the normalization constant $N$ is given by

\[ N = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(x^2 + xy + 2 y^2)} \; . \]

Examples like these arise in any discipline where the model describing a system depends on several (perhaps a huge number of) parameters that manifest themselves as inter-correlated variables that we can measure.  The cross-correlation between the variables $x$ and $y$ is displayed by the $xy$ term in the exponential.

The first task in determining the expectation value is to evaluate the normalization constant.  The general method for solving this task will be the focus of this installment. 

The method for integrating Gaussian integrals in one dimension is well known (see, e.g., here) but the approach for multivariate distributions is not as common.  The first point is to look at the quadratic form in the argument of the exponential as being built as the inner product between an array presenting the variables $x$ and $y$ and some matrix $A$.  In the present case this is done by recognizing

\[ x^2 + xy + 2 y^2 = \left[ \begin{array}{cc} x & y \end{array} \right] \left[ \begin{array}{cc} 1 & 1/2 \\ 1/2 & 2 \end{array} \right] \left[ \begin{array}{c} x \\ y \end{array} \right]  \; . \]

Looking ahead to higher dimensional cases, we can see that this expression can be written more generally and compactly as $r^T A r \equiv r_i A_{ij} r_j$, where $r$ is a column array presenting the independent variable of the distribution (now imagined to be $n$ in number) and $A$ is a real $n \times n$ matrix; the superscript $T$ indicates that the transpose of the original array is being used. 

With this new notation, any $n$-dimensional Gaussian distribution can now be written as

\[ e^{-(r^T A r)}  \; . \]

Since this expression is symmetric in $r$, $A$ can be taken to be symmetric without loss of generality.  This observation guarantees that $A$ can be diagonalized by an orthogonal matrix $M$ such that $M^T A M = D$, $M^T M = M M^T = 1$, and $D$ is a diagonal matrix.  This, in turn, means that the quadratic form can be rewritten as

\[ r^T A r = r^T (M M^T) A (M M^T) r = \\ (r^T M) (M^T A M) (M r) = (r^T M) D (M^T r) \; . \]

The next step is to define a new set of variables defined by $s = M^T r$, which then decouples the multivariate distribution into a product of $D$ independent one-dimensional Gaussians:

\[ e^{ – (r^T A r) } = e^{ – (s^T D s) } = \prod_{i=1}^{n} e^{-d_i s_i^2} \; , \]

where the $\{d_i\}$ are the individual, on-diagonal elements of $D$ and the $\{s_i\}$.  For the integral to exist (i.e., converge), we need one additional stipulation: that all of the $d_i > 0$.  Said differently, we require that $A$ have positive eigenvalues (i.e., be positive definite).

The final step is to recognize that the measure of the integral changes in a convenient way when we make the substitution $r = M^T s$, since

\[ d^n r = det(M^T) d^n s = d^n s  \; , \]

where we’ve exploited the fact that, since $M$ is orthogonal, its determinant is 1.

Putting all of these pieces together we now have

\[ \int_{R^n} d^n e^{-(r^T A r} = \int_{R^n} d^n s \prod_{i=1}^n e^{-d_i s_i^2} = \prod_{i=1}^{n} \int_{-\infty}^\infty ds_i e^{-d_i s_i^2} \; .\]

Since the right-hand side is a product of independent Gaussian integrals, it simplifies to

\[\prod_{i=1}^{n} \int_{-\infty}^\infty ds_i e^{-d_i s_i^2} = \prod_{i=1}^{n} \sqrt{\frac{\pi}{d_i}} = \frac{\pi^{n/2}}{\sqrt{d_1 \cdots d_n}} \; . \]

The denominator of the last expression is the determinant of the $D$, which is also the determinant of $A$.  So, the final formula is

\[ \int_{R^n} d^n r e^{-(r^T A r)} = \frac{\pi^{n/2}}{\sqrt{det(A)}} \; .\]

Returning to the original example, the formula tells us that, since $det(A) = 7/4$, then the normalization constant $N = 2\pi/\sqrt(7) \approx 2.37482$.  We can confirm this using brute force Monte Carlo by uniformly sampling across the ranges $x \in [-6,6]$ and $y \in [-12,12]$.  The limits of these ranges were selected by going out roughly to 6 times the respective eigenvalue ($d_x = 0.79$, $d_y = 2.21$), but the estimated integral is insensitive to fine details as long as the computational domain is sufficiently large.  That said, there is an art to picking the computational domain, and sophisticated pieces of software are designed to adjust to the integrand.  Implementing the brute force Monte Carlo in python gave an estimate of $N = 2.37661$ for 10 million trials, an error of less than $0.1\%$.

As a last sample, suppose we provide a bigger matrix given by:

\[ A = \left[ \begin{array}{cccc} 3 & 1 & 5 & 3\\ 1 & 5 & 2 & 0\\ 5 &  2 & 10 & 3 \\ 3 & 0 & 3 & 14 \end{array} \right] \; .\]

The Gaussian integral is

\[ N = \int_{R^n} d^n r \,  e^{-(3w^2 + 2wx + 10wy + 6w^2 + 5x^2 + 4xy + 10y^2 + 6yz + 14z^2)} \; .\]

The eigenvalues of $A$ are: $\{17.63,0.26,9.85,4.26\}$.  Application of the formula gives $N = 0.70497$ compared with a brute force Monte Carlo estimate of $0.71061$.

With this operation well in hand, next month’s post will focus on calculating various moments of these distributions.