Analog Logic in the Digital Age

Latest Posts

A Symbolic Experiment – Part 6: Elementary Term Rewriting

Conrad SchiffJune 26, 2026

After many months of build-up, we are now in a position to begin methodically exploring how to modify Sympy expression trees that we have defined earlier. Since Sympy expressions are, by design, immutable, it is important to understand that when terms like ‘modify’, ‘change’, ‘substitute’, ‘replace’ and other synonyms are used they are short hands for the more precise but far more wordy phrases like ‘making a new expression whose structure is a modified version of an older expression’ or ‘making a new expression that is based on an older expression but for which certain subexpressions have been replaced’, and so on. While the shorted phrases are easier to deal with it is vital that we don’t forget that they aren’t to be taken literally.

Methods for making changes to an expression tree can be grouped into two broad categories: 1) by-hand methods and 2) built-in methods.

An example of a by-hand method was featured in an earlier post about factoring where a small function using .has, .as_ordered_terms, and .factor was used to factor the, by now very familiar, test case of

\[ x^2 – 2ax + a^2 + y^2 \rightarrow (x-a)^2 + y^2 \; .\]

The use of .has(a,x) (see last post) allows the relevant terms to be separated from the polynomial without needing to ‘see’ them or ‘spell out’ what they are. The function .as_ordered_terms was used to separate the top level of the Add function. Note that for other types of expressions, other choices, like .as_ordered_factors (for Mul) and args, may be needed, with args being the most generic of the lot. The key point being that the tree was ‘broken down’ into smaller pieces, some of which were amenable to the kinds of simplification functions discussed in a previous post.

There are three often used functions in the second category of built-in methods. On these points, I received an excellent high-level summary from Oscar Benjamin, who is a Sympy developer. I will draw on this summary for what follows with two caveats: first, these methods are nuanced and complex, so I won’t be able to fill in as ably as one of the developers can and, second, any mistakes are solely a consequence of my ignorance.

The three methods are .xreplace, .subs, and .replace, given in order of what seems to be best described as generality and complexity. That is to say that Oscar Benjamin has described them as:

The xreplace function is the fastest and simplest. The replace function is the most flexible. The subs function is the slowest and is the only one that applies any semantic meaning to the substitution.

The following table summarizes their usage.

Method	Usage	Comment
.xreplace	expr.xreplace({o1:n1, o2:n2, …})	Purely structural replacement which identifies a exact subexpression to be replaced (hence the ‘x’ in xreplace).
.subs	expr.sub({o1:n1, o2:n2, …})	Mostly structural replacement but some mathematical equivalence is used in simplify terms.
.replace	expr.replace(old_pattern,new_pattern)	Purely structural replacement which uses a generic pattern to replace all subexpressions that match)

Note: {o1:n1, o2:n2, …} is a dictionary specifying how old expressions (o1, o2, etc.) are to be replaced by new terms (n1, n2, etc.).

In order to illustrate the differences between these methods, we’ll start with the master expression

\[ Q = a + x + 2 x^2 + 3 x^3 + 4 x^4 + 5 \cos(x) + \\ 6 \cos(x^2) + 7 \cos^2 (x) + 8 \cos(b) + 9 e^{-x^2} \; .\]

We’ll look at all three methods under two term-rewriting rules:

Rule 1: $x^2 \rightarrow y$
Rule 2: $\cos(x) \rightarrow \sin(x)$

Rule 1: $x^2 \rightarrow y$

Under the action of .xreplace $Q$ becomes

\[ Q \xrightarrow{x^2 \rightarrow y} a + x + 2 y + 3 x^3 + 4 x^4 + 5 \cos(x) \\ + 6\cos(y) + 7 \cos^2(x) + 8 \cos(b) + 9 e^{-y} \; ,\]

showing that .xreplace simply went through $Q$’s tree structure and structural rewrote sub-trees as shown in the following figure.

Under the action of .subs $Q$ changes more, becoming:

\[ Q \xrightarrow{x^2 \rightarrow y} a + x + 2 y + 3 x^3 + 4 y^2 + 5 \cos(x) + \\ 6\cos(y) + 7 \cos^2(x) + 8 \cos(b) + 9e^{-y} \; .\]

Note that .subs ‘understood’ that $x^2 \rightarrow y$ properly, and without assumptions, means that $x^4 \rightarrow y^2$. Odd powers of $x$ remain unchanged since .subs doesn’t know enough to invert the replacement since it would have to choose between $x \rightarrow +\sqrt{y}$ and $x \rightarrow -\sqrt{y}$.

Finally, under the action of .replace, $Q$ yields

\[ Q \xrightarrow{x^2 \rightarrow y} a + x + 2 y + 3 x^3 + 4 x^4 + 5 \cos(x) + \\ 6\cos(y) + 7 \cos^2(x) + 8 \cos(b) + 9 e^{-y} \; ,\]

which is exactly identical to what resulted from the application of .xreplace.

Rule 2 $\cos(x) \rightarrow \sin(x)$

Under the application of .xreplace, $Q$ becomes

\[ Q \xrightarrow{\cos(x) \rightarrow \sin(x)} a + x + 2 x^2 + 3 x^3 + 4 x^4 + 5 \sin(x) + \\ 6\cos(x^2) + 7 \sin^2(x) + 8 \cos(b) + 9 e^{-x^2} \; , \]

again, as in the example for Rule 1, .xreplace only makes a structural change/rewrite for only those subexpressions that, in a tree-sense, exactly match the old subexpression.

The application of .subs gives identical expressions for $Q$, as does a naïve use of .replace. However, a much more interesting example using .replace involves using Sympy’s ‘Wild’ symbol. The following snippet of the Jupyter notebook

shows how .replace can substitute the sym.cos node with a sym.sin node without affecting the leaves and branches of the tree underneath. That’s why every instance of $\cos$ changes to $\sin$ regardless of whether the argument was $b$, $x$, or $x^2$.

Final Thoughts

There are a number of variations and nuances that haven’t been covered above but I believe that with the material provided there is enough structure that the Sympy documentation is consumable and a person who understands the material so far should be able to reach the manipulations they want with only some limited trial-and-error. That said there are two nuances that are worth noting. First, .subs is aware of what Sympy calls bound variables. Bound variables are those that appear, for example, as part of a sum or integral; .subs will not allow them to be rewritten while .xreplace will – again because .subs has some notion of mathematical semantics whereas .xreplace is simply for changing the tree regardless of syntax. The other is that .replace can take functions as arguments so that complex selection of subexpressions can be done. It is that latter functionality that we will try to use as we finally return to our main goal of having a Fourier transform rewriting system.

Uncategorized

Jun

2026

A Symbolic Experiment – Part 5: Finding Subexpressions

Conrad SchiffMay 29, 2026

In the last post, we looked at how to determine if any two SymPy expressions (call them expr1 and expr2) were structurally the same and/or mathematically equivalent using the ‘==’ operator to determine structural equivalency or using ‘simplify(expr1-expr2)’ to determine mathematical equivalency. (Note that there is a built-in .equals function belonging to the expression class that performs the same function as the subtraction idiom so that ‘expr1.equal(expr2)’ yields the same results as ‘simplify(expr1-expr2)’.

However, it is often the case where we want to know if a subexpression matches some part of a larger expression’s tree as a prelude to simplification or modification of the original expression. We will confine this installment to those design patterns that combine built-in SymPy functions with relatively simple python code and will save more involved techniques that walk the expression tree for a future post.

Some of the material presented here will necessarily overlap with the concepts from the previous post since the techniques used for determining if a sub-expression is contained within a larger tree require being able to determine if the structure of a branch of larger tree matches the structure of the expression being sought.

We’ll start with the somewhat contrived but nonetheless valid mathematical expression

\[ a + \frac{1}{2} b \sin(x^2) + \frac{3}{4} c \sin^2 (x^2) + d \sin^4 (x^2) + e \sin^7 (y^2) + \frac{1}{a + \frac{x}{c}} \; . \]

with a corresponding tree that looks like

Before moving to the next step of finding sub-expressions within the original expression, a couple notes are important to set context.

First and foremost, since this example (as well as the vast majority of the examples used in SymPy documentation) is ‘known’ from the outset, it is easy to lose sight of the fact that, in general, we won’t know much (if anything) about the expression we are working with. To be concrete, here we know by initial construction and subsequent inspection that the expression, being made of the sum of 6 terms, must have an Add at the top of the expression tree. But the process of general term rewriting that is usually done in computer algebra will typically put us in a position where we will not. An excellent example of the emergence of an ‘unknown’ addition would be the case of the product rule from calculus that transforms a top node from being a product (a Mul in SymPy terms) into a sum (an Add):

\[ \frac{d}{dx} \left( f(x)g(x) \right) = f’(x)g(x) + f(x)g’(x) \; . \]

Similar issues show up in integration, where an unknown constant of integration emerges. Therefore, it is important to be able to examine a tree without any a priori knowledge about its structure.

Second, the things that we are looking for within the tree are of mixed type: some items are Atoms (Symbols, Integers, Floats, Rationals, etc.); some are Functions with the additional subdivisions into:

base functions like Add and Mul, which exist as abstract class for most situations but which come into play when testing for addition or multiplication withing the expression;
mathematical functions like sin, cos, or exp, which exist mostly as actual mathematical functions (e.g., sin(x)) but also show up as abstract classes when testing for their presence (e.g., Fourier expansions where we might want to test if there are no sines present due to symmetry of the expression being expanded);
and user-defined functions like $f(x)$ and $g(x)$ where an asserted relationship exists but where no details of that relationship are either given or known a priori (e.g., solving a differential equation for a desired function).

In addition, SymPy provides some mechanisms to perform generic matching for symbols using its Wild class (e.g. sin(a+b) matches both sin(x+2) and sin(7-y), etc.).

In an operational sense, atoms form the bottom of each branch of the tree and functions are everything else that can have children. As a result of this elastic nature of functions, some additional considerations come into play when programming to find them.

With those thoughts in mind, let’s return to the original expression above. The most useful built-in functions for examining an expression tree in the programming sense (not visual inspection using graphviz) are:

.atoms() – iterator that gives returns all the unique Atoms
.has() – predicate Boolean that returns True if the sub-expression is found in the expression tree or False if it isn’t
.find() – returns, in a set, all the unique instances of the sub-expression; particularly useful when trying to find a given function (e.g., sin) with a variety of different arguments.

In addition, the .args function exists that returns all the branches of the tree, but since that one is most useful in the case where we are walking the expression tree, discussion of it will be deferred to a latter post.

What follows will illustrate the basic functionality of these three functions on the original expression above, which will be designated c1. Code snippets and the results will be presented with a brief commentary afterwards.

.atoms()

Invoking .atoms():

for atom in c1.atoms():
    print(atom,'\t',type(atom))

x         <class 'sympy.core.symbol.Symbol'>
2         <class 'sympy.core.numbers.Integer'>
c         <class 'sympy.core.symbol.Symbol'>
4         <class 'sympy.core.numbers.Integer'>
b         <class 'sympy.core.symbol.Symbol'>
7         <class 'sympy.core.numbers.Integer'>
1/2       <class 'sympy.core.numbers.Half'>
d         <class 'sympy.core.symbol.Symbol'>
y         <class 'sympy.core.symbol.Symbol'>
-1        <class 'sympy.core.numbers.NegativeOne'>
e         <class 'sympy.core.symbol.Symbol'>
3/4       <class 'sympy.core.numbers.Rational'>
a         <class 'sympy.core.symbol.Symbol'>

In this output, we can see some of the variety of Atoms that SymPy provides. In addition, to Symbol, Integer, and Rational, SymPy reserves ‘special’ numbers for a half and multiplication by a minus sign, for performance sake and convenience.

.has() and .find()

These two are best understood being invoked side-by-side using a list (called queries) of the patterns we are trying to find.

Invoking .has() and .find() (with the prior ‘import SymPy as sym’):

queries = [sym.Add,
           sym.Mul,
           sym.Pow,
           sym.exp,
           sym.sin,
           x,
           sym.sin(x**2),
           sym.sin(x**2)**2,
           y**2,
           sym.Rational]

for query in queries:
    if hasattr(query,'__name__'):
        print(query.__name__,' (n):\t',c1.has(query),c1.find(query))
    elif callable(getattr(query,'__repr__',None)):
        print(query.__repr__(),' (r):\t',c1.has(query),c1.find(query))
    else:
        print(query,' ():\t',c1.has(query),c1.find(query))

Output (slightly reformatted versus the code above):

Add  (n):  True  
   {a + x/c, a + b*sin(x**2)/2 + 3*c*sin(x**2)**2/4 + d*sin(x**4) + e*sin(y**2)**7 + 1/(a + x/c)}
Mul  (n):  True
   {e*sin(y**2)**7, x/c, d*sin(x**4), 3*c*sin(x**2)**2/4, b*sin(x**2)/2}
Pow  (n):  True
   {sin(x**2)**2, x**4, 1/c, 1/(a + x/c), sin(y**2)**7, y**2, x**2}
exp  (n):  False
   set()
sin  (n):  True
   {sin(x**4), sin(y**2), sin(x**2)}
x  (r):    True
   {x}
sin(x**2)  (r):  True
   {sin(x**2)}
sin(x**2)**2  (r):  True
   {sin(x**2)**2}
y**2  (r):   True
   {y**2}
Rational  (n):  True
   {2, 4, 7, 3/4, 1/2, -1}

Notice how the .find(sym.sin) returned the set {sin(x**4), sin(y**2), sin(x**2)} with all three unique instances of the sin function with different arguments not the four sin terms in c1. The powers to which some sin terms are raised are also ignored as are the coefficients multiplying them. Essentially, .find isolates in the expression tree those branches starting at sin(x**2), sin(x**4), and sin(y**2), omitting everything above. Visually, this can be represented as:

where the green ovals show what sub-trees .find has found within the larger tree.

One final note, to get the print to be as attractive as it was above (Add versus sympy.core.add.Add) one has to switch between __name__ and __repr__() as appropriate. The decoration after the string (either ‘(n)’ or ‘(r)’) signals which selection was made. This functionality is only included for debugging and understanding of the code and it doesn’t matter in any way for the actual manipulation of the trees.

Finally, it is important to see that .has() and .find() are looking for structural features and are not concerned with mathematical equivalence. To drive this point home, consider the following there ways to write the mathematically same expression:

tree1 = mu/2/(a+b/2)
tree2 = mu/(2*a+b)
tree3 = mu/(2*(a+b/2))

SymPy distributes the multiplication in the denominator so that tree2 and tree3 have the exact same structure but that the double divide in tree1 leads to a different structure:

<image>

As a result, issuing tree1.has(2*a) returns a False while tree2.has(2*a) returns True (as does tree3). Likewise, tree1.has(b/2) returns True while tree2.has(b/2) returns False. In both cases, the structurally correct sub-tree is either found or not found within the larger tree.

Uncategorized

May

2026

A Symbolic Experiment – Part 4: Tree Structures

Conrad SchiffMarch 27, 2026

Up to this point, this series of blogs on computer algebra within SymPy have scraped the surface, looking at some of the basic problems that can arise and the primitive or atomic methods that can be used to manipulate certain pieces of an expression. Hopefully, along the way, a new appreciation for the talents and faculties that even ordinary people have in recognizing patterns within the symbols of algebra has also been obtained.

Starting in this blog, we begin to dig much more deeply into the SymPy tree-structure and the syntactical rules that can be used to manipulate it. These manipulations will be more structural and not mathematically semantic. As an example of the distinction, we will be looking at ways to modify and rearrange the tree and its contents without ever asking if it is mathematically legitimate to do so.

There are three basic areas to develop a facility with:

Being able to detect if any two tree structures are the same
Being able to detect that a specified pattern exists within a larger tree
Being able to rewrite some portion of the tree.

When used together, we should be able to direct or guide the computer to rewrite expressions of arbitrary size into other expressions based on some prescribed specifications. This type of direction or guidance should not be confused with automatic term rewriting such as encountered in a rule-based system. In this latter case, deep questions about whether such automatic rewriting converges to a normal form (defined as the final, unchanging result) from repeated application of the rules arise and are quite difficult to settle. Our aims are much more modest; we simply want a set of tools that can allow us to dictate what should be done in the rewriting without having to perform the individual steps by ourselves.

In this blog post, we’ll explore the first basic step. We’ll defer the other two steps to future blogs.

The first step in understanding the tree structures that underly an expression is to recognize that mathematically equivalent expressions need not result in the same tree structures. This is a point quite distinct to the ‘special cases’ that arise when trying to work with exponentials or square roots.

The second step in understanding the tree structures is to also recognize that SymPy will sometimes mask the first rule by carrying out certain simplifications before constructing the final tree representation.

To illustrate these two points, let’s look at six sets of mathematically equivalent expressions: 1) Pure Addition, 2) Pure Multiplication, 3) Square of a Pure Expression, 4) Mixed Addition, 5) Mixed Multiplication, and 3) Square of a Mixed Expression.

Pure Addition

The first set of mathematically equivalent expressions deal with how addition of multiple terms are processed and represented:

\[ a + b + c + d = d + b + c + a = b + d + a + c = \ldots \; . \]

No matter which of the 24 possible permutations are entered, SymPy automatically sorts the input strings and provides the same tree structure. For instance, $\text{expr1} = a + b + c + d$ gives the tree

and $\text{expr2} = d + b +c + a$ gives the same tree

In addition we can ask is the expressions associated with each are recognized by SymPy as being the same. SymPy provides several ways to do this with the most convenient being the ‘==’ operator. Thus

\[ \text{expr1} == \text{expr2} \; \]

yields True.

Pure Multiplication

In a similar way we can see that $\text{expr1} = a \cdot b \cdot c \cdot d$ and $\text{expr2} = d \cdot c \cdot a \cdot b$ yield the same tree

and that $\text{expr1} == \text{expr2}$ yields True.

Square of a Pure Expression

Our next example involves composing two separate functions. The expressions are $\text{expr1} = (x+a)^2$ and $\text{expr2} = (a+x)^2$, both give the same tree,

and $\text{expr1} == \text{expr2}$ yields True. Note that the presence of the Add results in another tier in the expression tree.

In all of these cases, SymPy produces the same expression tree because it sorts the input strings when the expressions are defined and the expressions are relatively simple. Thus it is able to easily determine that these mathematically equivalent expressions are also structurally equivalent using ‘==’.

In the next set of mathematically equivalent expressions, we’ll find that SymPy’s ‘==’ operator will return True only sometimes and False in others. This wrinkle will happen because the expression trees will get more complicated all because of the inclusion of a ‘-1’.

Mixed Addition

The first example involves changing the sign of one of the Symbols used in the Pure Addition case:

\[ a + b + c – d = -d + b + c + a = b – d + a + c = \ldots \; . \]

Of the 24 possibilities let’s let $\text{expr1} = a + b + c – d$ and $\text{expr2} = -d + b +c + a$. Evaluating each of these gives the same tree

While SymPy does sort the symbols note that the ‘d’ is at a lower level with the SymPy function Mul occupying the same level that ‘d’ had in the previous version since the presence of a minus sign automatically creates a mixed tree that has more functions than Add.

Nonetheless, $\text{expr1} == \text{expr2}$ yields True since the two expressions are both mathematically and structurally equivalent.

Mixed Multiplication

In this next example, we let $\text{expr1} = -a \cdot b \cdot c \cdot d$ and $\text{expr2} = b \cdot a \cdot c \cdot (-d)$. SymPy once again sorts the inputs and produces the same expression tree for each one:

Note that here, because $-a = -1 \cdot a $, the tree that results is only different in one single term – the addition of the NegativeOne symbol in the tree. And, as expected, ‘==’ yields True when operating on both expressions.

Square of a Mixed Expression

The final set of mathematically equivalent expressions are

\[ (x-a)^2 = (a-x)^2 \; \; .\]

Regardless of the class of objects that $x$ and $a$ are (i.e., whole numbers, integers, real numbers, complex numbers, matrices, etc.) this identity holds since one can factor out a $-1$ from either side which squares to the identity:

\[ (a-x)^2 = ((-1)(x-a))^2 = (-1)^2 (x-a)^2 = (x-a)^2 \; .\]

However, the tree structure associated with each term is different. The expression $\text{expr1} = (a-x)^2$ has the tree structure on the left while $\text{expr2} = (x-a)^2$ has the structure on the right.

Since the tree structures are different, $\text{expr1} == \text{expr2}$ yields False. There is a way to demonstrate that they are mathematically equivalent by evaluating $\text{simplify( expr1 – expr2)}$ to zero.

Unfortunately, simplify isn’t a silver bullet. First, since it throws a complex set of primitive simplification techniques at an expression following a heuristic it doesn’t allow us to find small differences in a tree structure (like that above) so it doesn’t help when we want to make a ‘surgical’ change. SymPy’s simplification is like a sledge hammer in comparison. In addition, it isn’t guaranteed to work. So, we need to have ‘==’ in the toolbox.

Uncategorized

Mar

2026

A Symbolic Experiment – Part 3: Simplifying

Conrad SchiffFebruary 27, 2026

This month, we will take a first look at ‘simplifying’ expressions and the some of the primary differences in how humans do it by hand versus how it is done when humans coax a machine to do it by manipulating a machine representation.

I’ve had occasion, when starting this series of blog posts, to reference Peter Norvig’s Paradigms of Artificial Intelligence Programming: Case Studies in Common LISP and I’ll do so again

According to the Mathematics Dictionary (James and James 1949), the word “simplified” is “probably the most indefinite term used seriously in mathematics.” The problem is that “simplified” is relative to what you want to use the expression for next. Which is simpler, $x^2 + 3x + 2$ or $(x-1)(x-2)$? The first makes it easier to integrate or differentiate, the second easier to find roots. We will be content to limit ourselves to “obvious” simplifications. For example, $x$ is almost always preferable to $1x+0$.

Last month’s examination of the complexities associated with traversing the tree with the goal of factoring a subexpression, although the approach was optimal, illustrates some of the difficulties. In fact, those difficulties are the tip of the iceberg for what is a subdiscipline in computer science that is called term rewriting.

Term rewriting is a mathematically sophisticated and subtle subject and, for this installment at the least, we will confine ourselves to the built-in functions provided by SymPy for performing term rewriting that happens largely without direct involvement from the user and can be said to fall under the heading of simplification. For the purposes of this discussion, simplification is defined as a method for generally making the number of terms in an expression go down and it should be held in contrast to expansion that generally makes the number of terms go up.

As an example, consider the tree expressions for the example Norvig provides. The tree for $x^2 + 3x + 2$ looks like

with seven nodes. The corresponding tree for $(x-1)(x-2)$ looks like

with 6 nodes. So there is a savings in terms of tree representation even though there are more typographical symbols needed in rendering $(x – 1)(x – 2)$ (eight of them, ignoring whitespace) versus $x^2 + 3x + 2$ (seven of them, ignoring white space and the ‘^’ used to typeset the exponent). The point of this somewhat whimsical discussion is that simplification is not only a matter of utility (what Norvig describes as ‘relative to what you want to use the expression for next’) but also of aesthetics.

In any event, there are at least 11 specialized simplification methods in SymPy. The following table lists them (with much of the verbiage taken from the help pages), their nominal purpose, and an example of each.

Method	Purpose	Example
cancel	Takes any rational function and puts it into the standard canonical form, $p/q$, where $p$ and $q$ are expanded polynomials with no common factors.	$\text{cancel}\left( \frac{x^2 + 2*x +1}{x^2 + x} \right)$ $= \frac{x+1}{x}.$
collect	Collects common powers of a term in an expression.	$\text{collect} \left(x^3 -z x^2 + 2 x^2 + x y + x – 3 \right)$ $= x^3 + x^2(2-z) + x(y+1) – 3$
combsimp	Simplifies combinatorial expressions involving factorials and binomial coefficients; the symbols need to be integers (generalization to real numbers is found under gammasimp below). Used in conjunction with the factorial and binomial expressions	$\text{combsimp} \left( \frac{n!}{(n-3)!} \right) = n(n – 1)(n – 2)$ $\text{combsimp} \left( \frac{\binom{n+1}{k+1}}{\binom{n}{k}} \right) = \frac{n+1}{k+1}$
factor	Takes a polynomial and factors it into irreducible factors over the rational numbers	$\text{factor}\left(x^2 z + 4 x y z + 4 y^2 z \right)$ $=z(x + 2y)^2$
factorlist	Returns a more structured output of the factors	$\text{factor}\left(x^2 z + 4 x y z + 4 y^2 z \right)$ $= (1, [(z, 1), (x + 2y, 2)])$
gammasimp	Simplifies expressions with gamma functions (using SymPy’s gamma) or combinatorial functions with non-integer argument	$\text{gammasimp} \left( \Gamma(x) \Gamma(1 – x) \right)$ $= \frac{\pi}{\sin(\pi – x)}$
logcombine	Applies logarithm identities $log(xy)$ $= log(x)$ $+ log(y)$ and $log(x^n)$ $= n log(x)$ subject to certain assumptions on $x$ and $y$ (such as true for $x$ and $y$ being real and generally false for $x$ and/or $y$ complex). logcombine has a way to force the application to ignore assumptions using the ‘force=True’ optional argument.	$\text{logcombine} \left( \log (x) + \log (y) \right)$ $= \log (x y)$ and $\text{logcombine}( n \log(x) )$ $= \log \left( x^n \right)$
powdenest	Applies the exponential identity $(x^a)^b = x^{ab}$ subject to the assumptions on $b$ (true if $b$ is an positive integer). powdenest has a way to force the application to ignore assumptions using the ‘force=True’ optional arguments.	$\text{powdenest} \left( \left(z^a\right)^b \right) = z^{ab}$
powsimp	Applies the exponential identities $x^a x^b$ $= x^{a+b}$ (always true) and $x^a y^a$ = $(x y)^a$ (true if $x, y \geq 0$ and $a$ real	$\text{powsimp} \left( x^a x^b \right)$ $= x^{a+b}$ $\text{powsimp} \left( x^a y^a \right) = (x y) ^ a$
radsimp	Rationalizes the denominator by removing square roots.	$\text{radsimp} \left( \frac{(2 + 2 \sqrt{2})x + (2 + \sqrt{8})y)}{2 + \sqrt{2}} \right)$ $= \sqrt{2} (x + y)$
ratsimp	Puts an expression over a common denominator, cancel and reduce.	$\text{ratsimp} \left( \frac{1}{x} + \frac{1}{y} \right) = \frac{x+y}{xy}$
together	Takes an expression or a container of expressions and puts it (them) together by denesting and combining rational subexpressions – often looks similar to ratsimp but it seems to be more general.	$\text{together} \left( \frac{1}{1+1/x} + \frac{1}{1+1/y} \right)$ $ = \frac{ x (y + 1) + y (x + 1)}{(x + 1)(y + 1)}$
trigsimp	Simplifies expressions using trigonometric identities; works with hyperbolic trig functions; uses heuristics to find the “best” one	$\text{trigsimp} \left( 2 sin(x)^2 + 2cos(x)^2 \right) = 2$

Each of the functions above are building blocks in working through the steps of term rewriting. They represent common processes for specific steps. However, they are incapable of performing all the steps needed to simplify a generic expression. The generic function simplify “tries to apply intelligent heuristics to make the input expression “simpler”.” It is interesting to note that again the machine is far more brittle and unintelligent than an average human.

In subsequent blog posts, we’ll explore subs, replace, and xreplace methods within SymPy. These methods are the three ‘canonical’ methods within SymPy for rewriting terms that can be used in conjunction with the methods above to cover broader aspects of term rewriting.

Uncategorized

Feb

2026

A Symbolic Experiment – Part 2: Factoring

Conrad SchiffJanuary 30, 2026

Last month we started on a journey to explore and experiment with the computer algebra system SymPy that is freely available in the Python ecosystem. The aim is to create, within this package, a rule system that implements the basic transformations and identities of the Fourier Transform. But the goal is very loose and a great deal of emphasis is placed on the journey more so than the final product. To this end, there are three focus areas: 1) working out the steps needed to manipulate symbolic expressions, 2) looking at what an intelligent agent would need to do as a way of exploring more artificial intelligence, and 3) discovering how the human does these steps differently and, in the process, having some new found appreciation for the subtleties and brilliance of the human mind.

To start, we look at a classic algebraic manipulation that comes up often in the study of all sorts of disciplines ranging from computer graphics, to gravity and electromagnetism, to geometry and trigonometry – namely the application of the Pythagorean theorem to find the distance or magnitude of a vector by computing the square root of the sum of the squares.

To keep things notationally simple, we’ll consider the very simple expression:

\[ D = \sqrt{ (x-a)^2 + y^2 } \; \]

made up of the symbols $\{a,x,y\}$.

In a variety of settings, society ‘expects’ that competent students of algebra to either recognize or, at a minimum, be able to verify that

\[ D’ = \sqrt{ x^2 – 2 a x + a^2 + y^2} \; \]

is ‘equal’ to $D$.

Of course, the word ‘equal’ very elastic and, as a result, it isn’t precise enough for either deep exploration of the human mind or for the shallow, do-as-I-am-told workings for a computer. Let’s try to nail that down with some better definitions.

First, let’s define the term mathematically equal to contain the meaning that a teacher wants to convey when he says that $D=D’$. Mathematical equality means that for every choice of values for $\{a,x,y\}$ the numerical result obtained by substitution from $D$ is exactly the same as the numerical result obtained from $D’$ by the same process.

Now let’s define the term structurally equal to mean that the formal way the symbols are written in the expression are the same even if the identity of the symbols are not. For example,

\[ D’’ = \sqrt{ (q-q_0)^2 + p^2 } \; \]

is structurally equal to the expression for $D$ since we recognize that the symbol substitutions

\[ x \rightarrow q \; ,\]

\[ a \rightarrow q_0 \; , \]

and

\[ y \rightarrow p \; \]

make $D$ look the same on paper as $D’’$. Note that two expressions that are structurally equal need not be mathematically equal if assumptions about the different symbols aren’t the same. For example, if $x \in (-\infty,\infty)$ but we restrict $p \in [0,\infty]$, then, despite their structurally equality $D$ is not mathematically equal to $D’’$ when $x < 0$.

We will use the term exactly equal to mean that two expressions are both mathematical equal and structurally equal and have the same symbols.

These three definitions have holes and limitations. The holes are a by-product of limitations of human logic and we won’t try to patch them so much as work around them when the time comes. Regarding the limitations, we can give a general notion of where they will show up and then revisit them in the future. The primary limitation(s) is that the notion of equivalency is left out. To give a flavor of this consider the two expressions

\[ \frac{d}{dx} \left( x^2 – 3 a x + 9 \right) \; \]

and

\[ 2x -3a \; .\]

These two expressions are neither mathematically equal (one can’t simply substitute in a value for $x$ before taking the derivative) nor structurally equal (the symbol structure isn’t the same). But they are equivalent in the sense that applying the derivative in the first leads one to the second. And, there is another wrinkle when considering moving from the second expression to the first, in that $2x – 3a$ is equivalent to an infinite number of expressions of the form

\[ \frac{d}{dx} \left( x^2 – 3 a x + constant \right) \; .\]

Since we will have our hand full just dealing with how to teach an agent how to determine if $D$ is structurally or mathematically equal to $D’$, we will defer these deeper matters and look at a simple example from basic physics.

It is typical for a professor, when teaching say electromagnetism, to look at $D’$ and simply highlight the first three terms under the radical and say something to the effect that the form a perfect square which can be ‘reduced’ or ‘simplified’ to the other.

However, there is no cognitive mind behind a computer (no matter how much training data it may have ingested) and so it can’t fill in the gaps and move (albeit not usually effortlessly) between the various ambiguities and elastic meanings in the way a human can.

To understand this point better, consider that to represent the expression above requires nesting $x^2 – 2 a x + a^2 + y^2$ under a square root symbol. That’s four terms ‘owned’ by the square root, which we wish to ‘factor’ into two terms $(x-a)^2 + y^2$. In addition, each of these terms is complicated as none are ‘atomic’. A term is atomic if it consists of a symbol and nothing else.

Driving this point home is easier done with a visual. Using the graphviz application and the corresponding Python API, we can visually display how these various expressions are represented internally. SymPy uses a tree structure that, for the expression $D’$, looks like

Every node in the a SymPy tree is either a function or a symbol. Functions own (almost always) children nodes reflecting their composite. Symbols are terminal nodes reflecting their atomic nature. At the top of the tree is the Pow function (for power) with two main branches: Add and Half. Add is the function that owns the four terms that algebraically add together while Half is a special symbol meaning 1/2. SymPy reserves a special symbol for this since division by 2 is so common. Of the four main branches of Add, three are Pow and one is Mul (for multiply). Like Add, Mul can own an arbitrary number of branches. In this case there are three, each terminating with the symbols $-2$, $a$, and $x$.

In order to manipulate the only some of the contents under the square root we must be able to find that portion of the tree that corresponds to $x^2 – 2 a x + a^2$, remove it, manipulate it, and then return the new structure to the tree so that it looks like:

Getting the contents of the square root is relatively simple: we simple ask for the arguments of the expression and we get a tuple containing the Add branch and the Half Symbol.

The Add branch is now a polynomial expression that we might be tempted to try SymPy’s factor on. However, Factor doesn’t know what to do with the portion involving $y^2$. However, if we isolate the portion of the expression involving just $x$ and $a$ by subtracting off the $y^2$ piece, factoring, and then adding $y^2$ back, we get a reasonable result. Both of these approaches are shone in the notebook snippet below:

This behavior is not unique to either this situation nor to SymPy. Asking Wolfram Alpha to factor $x^2 -2 a x + a^2$ works fine but asking it to perform the same function on $x^2 -2ax + a^2 + y^2$ doesn’t give an acceptable answer (although its answer differs from SymPy’s default but coincides with SymPy being directed to factor of the field of the reals).

Two final points. First, there is an algorithmic way of doing the separation of the polynomial into a $a$-$x$ part and a remainder that can be run without as much hand-holding as this snippet shows:

a, x, y = sym.symbols(‘a x y')

poly = x**2 - 2*a*x + a**2 + y**2

ax_part = sum(
    term for term in poly.as_ordered_terms()
    if term.has(a, x) and not term.has(y)
)

rest = poly - ax_part

sym.factor(ax_part) + rest

Second, and far more important. Just for fun, I asked Chat GPT to factor $x^2 -2 a x + a^2 + y^2$ both absent from and under the square root and it delivered the ‘professorial’ answer $(x-a)^2 + y^2$ in either case. It was also able to factor a more difficult SymPy example of $2x^5 + 2x^4y + 4x^3 + 4x^2y + 2x + 2y + a$ into $2(x+y)(x^2 + 1)^2 + a$ even though both Wolfram Alpha and Sympy could not out of the box. I suspect the reasons for these successes are either that these are well-known examples that reside somewhere within its system or it knows how to make these systems work better than I do. The next logical question is then why are SymPy and Mathematica not out of business. I think the only answer to this is that these success are superficial. That real mathematical creativity is still beyond the capabilities of the machine. But, I suppose, time will tell.

Uncategorized

Jan

2026

A Symbolic Experiment – Part 1: First Expression

Conrad SchiffDecember 26, 2025

One of the most active areas of artificial intelligence in the earlier days (say from 1960s through the 1990s) was the development of symbolic AI to handle mathematical manipulations. In his book, Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp, Peter Norvig covers numerous applications that were solved as the ‘hot’ topics of the day; a time before the before the advent of deep learning came along and convinced most people that LLM’s and agentic systems would solve all the world’s problems. However, these former hot topics are still relevant today and I thought that it might be interesting to explore a bit to see what can be done with ‘old fashioned’ symbolic manipulation.

There are three reasons for embarking on this exploration.

First, understanding the nuts and bolts of symbolic manipulation makes one a better user of the ‘professional’ computer algebra systems (CASs) on the market today, such as wxMaxima, Maple, Mathematica, and SymPy. Each of these is polished and abstracted (to varying degrees) and seeks to help the user, typically a mathematician, scientist, or engineer, in making a model and subsequently applying it to some specific problem. However, each of these systems (especially the for-profit ones), want the user to be able to solve problems immediately without, necessarily, knowing how the CAS works behind the scenes. While these systems are quite powerful (again to lesser or greater degrees), experience has shown that most users don’t/can’t take full advantage of the system unless they can program in it. And to program in it requires a firm understanding of at least the rudiments of what is happening under the hood. This is a sentiment Norvig emphases repeatedly in his book, most notably by providing the following quote:

You think you know when you learn, are more sure when you can write, even more when you can teach, but certain when you can program.

–Alan Perlis, Yale University computer scientist

Second, there are always new mathematics being invented with brand new rules. Being able to manipulate/simplify and then mine an expression for meaning transcends blindly applying a built-in trig rule like $\cos^2 + \sin^2 = 1$, a generic ‘simplify’ method, or some such thing. It often requires understanding how an expression is formed and what techniques can be used to take an expression apart and reassemble as something more useful.

Third, and finally, it’s fun – plain and simple – and most likely will be a gateway activity for even more fun to come.

Not that we’ve resolved to explore, the next choice is which part of terra incognita. The undiscovered country that I’ve chosen is the implementation of a simple set of symbolic rules, via pattern matching, that implement some of the basic properties of the Fourier Transform.

For this first experiment, I chose Sympy as the tool of choice as it seemed a natural one since it provides the basic machinery of symbolic manipulation for free in Python but it doesn’t have a particularly sophisticated set of rules associated with the Fourier Transform. For example, SymPy version 1.14 knows that the transform is linear but it doesn’t know how to simplify the Fourier Transform based on the well-known rule:

\[ {\mathcal F}\left[ \frac{\partial f(t)}{\partial t} \right] = i \omega {\mathcal F}\left[ f(t) \right] \; , \]

which is one of the most important relationships for the Fourier Transform as it allows us to convert differential equations into algebraic ones, which is one of the primary motivations for employing it (or any other integral transform for that matter).

Before beginning, let’s set some notation. Functions in the time domain will be denoted by lower case Latin letters (e.g., $f$, $g$, etc.) and will depend on the variable $t$. The corresponding Fourier Transform will be denoted by upper case Latin letters (e.g., $F$, $G$, etc.) and will depend on the variable $\omega$. Arbitrary constants will also be denoted by lower case Latin letters but usually from the beginning of the alphabet (e.g., $a$, $b$, etc.). The pairing between the two will be symbolically written as:

\[ F(\omega) = {\mathcal F}(f(t)) \; ,\]

with an analogous expression for the inverse. The sign convention will be chosen so that the relationship for the transform of the derivative is as given above.

For the first experiment, I chose the following four rules/properties of the Fourier Transform:

Linearity: ${\mathcal F}(a \cdot f(t) + b \cdot g(t) ) = a F(\omega) + b G(\omega)$
Differentiation: $\mathcal{F}\left[\frac{d^n f(t)}{dt^n}\right] = (i \omega)^n F(\omega)$
Time Shifting: $\mathcal{F}{f(t – t_0)} = e^{-i \omega t_0} F(\omega)$
Frequency Shifting: $\mathcal{F}{e^{i \omega_0 t} f(t)} = F(\omega – \omega_0)$

We expect freshman or, at worst, sophomores in college to absorb and apply the rules above and, so, we may be inclined to think that they are ‘straightforward’ or ‘boring’ or ‘dull’. But, finding a way to make a computer do algorithmically what we expect teenagers to pick up from a brief lecture and context is surprisingly involved.

Our hypothetical college student’s capacity to learn, even when not explicitly taught is an amazing and often overlooked miracle of the human animal. Unfortunately, teaching any computer to parse ‘glyphs on a page’ in a way that mimics what a person does requires a lot of introspection into just how the symbols are being interpreted. Mathematical expressions are slightly easier as they tend to have less ‘analogical elasticity’; that is to say, that by their very nature, mathematical terms and symbols tend to be more rigid that everyday speech. More rigid doesn’t mean completely free of ambiguities as the reader may reflect on when thinking about what the derivatives in the Euler-Lagrange equations actually mean:

\[ \frac{d}{dt} \left( \frac{\partial L}{\partial {\dot q}} \right) – \frac{\partial L}{\partial q} = 0 \; . \]

Let’s start with a simple mathematical expression that is needed for the Linearity property

\[ a \cdot f(t) + b \cdot g(t) \; . \]

When composing this expression, a person dips into the standard glyphs of the English alphabet (10 of them to be precise) and simply writes them down. If pressed, the person may go onto say “where $a$ and $b$ are constants and $f$ and $g$ are, as yet undefined functions of the time variable $t$”. But the glyphs on the page remain all the same. For a computer, are the glyphs ‘a’ and ‘b’ of the same type as the glyphs ‘f’ and ‘g’? And where does ‘(t)’ figure in?

SymPy’s answer to this (which matches most of the other CASs) is that ‘a’, ‘b’, and ‘t’ are Symbols and ‘f’ and ‘g’ are undefined Functions. The bold-face terms are syntactically how SymPy creates these symbolic objects. These objects can be used as ordinary Python objects with all the normal Python syntax (e.g., print(), len(), type(), etc.) coming along because SymPy tries to strictly adhere to the Python data model. As a result, we can form new expressions by using ‘+’ and ‘*’ without worrying about how SymPy chooses to add and multiply it various objects together.

The following figure shows the output of a Jupyter notebook that implements these simple steps.

As expected, querying type(a) gives back sympy.core.symbol.Symbol and type(f) returns sympy.core.function.UndefinedFunction. And, as expected, we can form our first expression

my_expr = a*f(t) + b*g(t)

from using ‘+’ and ‘*’ as effortlessly as we wrote the analogous expression above out of basic glyphs, without having to worry about details. But what to make of type(my_expr) returning sympy.core.add.Add?

The answer to this question gets at the heart of how most (if not all) CASs work. What we humans take for granted as a simple expression $a \cdot f(t) + b \cdot g(t)$ is, for SymPy (and all the other CASs that I’ve seen), a tree. The top node is this case is Add meaning that ‘+’ is the root and each of the two subexpressions ($a \cdot f(t)$ and $b \cdot g(t)$) are child nodes, each of type sympy.core.mul.Mul. SymPy symbols are atomic nodes that can only be root nodes or child nodes but can never serve as parent nodes. And the distinction between the Symbol and Function objects are now much clearer: giving a glyph the designation as a Function means it can serve as a parent node, even if all of its behavior is, as yet, undefined.

The innocent-looking expression from above has the tree-form

Next month, we’ll continue on to show how to examine, traverse, and modify this tree structure so that we can mimic how a human ‘plays with’ and ‘mine meaning from’ an expression.

Uncategorized

Dec

2025

Time Series 6 – Recursive Least Squares

Conrad SchiffMay 24, 2024

The last post introduced the notion of the recursive least squares estimation approach as a natural precursor to the Kalman filter and derived the Kalman-like gain that governed the state estimation update equation. Before moving on to a simple example of the Kalman filter, it seemed prudent to show the recursive least squares filter in action with a numerical example.

The example involves a mass falling in an unknown uniform gravitational field with unknown initial conditions. At regular time intervals, a measurement device measures the height of the mass above the ground but only to within certain precision due to inherent noise in it measuring processes. The challenge is then to deliver the best estimate of the unknowns (initial height, initial speed, and the uniform gravitational acceleration). We will assume that, beyond a measurement of the initial height, the other initial conditions are inaccessible to the measurement device since it only measures range. This example is adapted from the excellent article Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method by Aleksander Haber.

To prove that the recursive least squares method does generate a real least squares estimation, let’s first look at how a batch least squares approach to this problem would be performed.

The dynamics of the falling mass are given by

\[ d(t) = d_0 + v_0 t + \frac{1}{2} a t^2 \; , \]

where $d(t)$ is the height above the ground at a given instant.

We assume that the above equation is exact; in other words, there are no unknown forces in the problem. In the language of Kalman-filtering and related optimal estimation, there is said to be no process noise, with it understood that the word ‘process’ is a genetic umbrella meaning the dynamics by which the state evolves in time, for which Newton’s laws are an important subset.

By design, measurements come in every $\Delta t$ seconds and so the time evolution of the fall of the mass is sampled at times $t_k = k \Delta t$. At measurement time $t_k$ the amount the mass has dropped is

\[ x_k = x_0 + v_0 k \Delta t + \frac{1}{2} k^2 \Delta t^2 a \; \]

and the corresponding measurement is

\[ z_k = x_k + \eta_k \; , \]

where $\eta_k$ is the noise the measuring device introduces in performing its job.

The batch least squares estimation begins by writing the evolution equation in a matrix form

\[ z_k = \left[ \begin{array}{ccc} 1 & k \Delta t & \frac{1}{2} k^2 \Delta t^2 \end{array} \right] \left[ \begin{array}{c} x_0 \\ v_0 \\ a \end{array} \right] + \eta_k \equiv H_{k,p} x_p + \eta_k \; . \]

The array $x_p$ contains the constant initial conditions and unknown acceleration that is being estimated.

For this numerical experiment, $x_0 = -4m$, $v_0 = 2 m/s$ (positive being downward), and $a = 9.8 m/s^2$ the usual value for surface of the Earth (give or take regional variations). Measurements are taken every $\Delta t = 1/150 s$ over a time span of 14.2 seconds. To illustrate the meaning of $H_{k,p}$, the value of the matrix at $k=0$ is

\[ H_{k=0,p} = \left[ \begin{array}{ccc} 1 & 0 & 0 \end{array} \right] \; \]

corresponding to an elapsed time of $t_k = 0$ and at $k = 75$

\[ H_{k=0,p} = \left[ \begin{array}{ccc} 1 & 0.5 & 0.125 \end{array} \right] \; \]

corresponding to an elapsed time of $t_k = 0.5$.

The merit of this matrix approach is that by putting every measurement into a batch we can write the entire measured dynamics in a single equation as:

\[ z = H x + \eta \; . \]

The array $z$ has $M$ elements (for the number of measurement), where $M$ is the maximum value of the time index $k$ (for this specific example $M$ = 2131). The matrix $H$ is a $M \times n$ where $n$ is the number of estimate parameters ($n$ = 3, in this case) and $x$ is the array of parameters to be estimated). The array $\eta$ represents the $M$ values of random noise (assumed to be stationary) that were present in the measurements. The values of $\eta$ used in this numerical example were drawn from a normal distribution with a zero mean and a standard deviation of 1; the latter value determines the value of $R_k = 1$.

Since, in the matrix equation above, the matrix $H$ isn’t square, it may not be obvious how to solve for $x$. However, there is a well-known technique using what is known as the Moore-Penrose pseudoinverse.

In short, to solve for $x$ in a least squares sense, we set $\eta$ to zero (i.e., we ignore the noise assuming the classical argument that with enough measurements the influence of the noise will average to zero). We then left-multiply by the transpose of $H$ to get

\[ H^T z = (H^T H) x \; .\]

The quantity $H^T H$ is a $(n \times K) \cdot (K \times n) = n \times n $ matrix (in this case $3\times 3$, which is invertible. We then arrive at the compact form of the estimate for $x$ as

\[ x_{est} = (H^T H)^{-1} H^T z \; .\]

For this numerical experiment, the resulting estimate is

\[ x_{est} = \left[ \begin{array}{c} -3.95187658 \\ 1.99300377 \\ 9.80040125 \end{array} \right] \; .\]

This is the same value one gets by using a package like numpy’s linalg.lstsq.

In the recursive method, all of the data are not stored and handled at once but rather each point is taken and processed as it comes in.

The algorithm developed in the last post is summarized as:

Start each step with the current time $t_k$, measurement $z_k$, and the previous estimates of the state $x_{k-1}$ and the covariance $P_{k-1}$
Perform the $k$-th calculation of the gain: $K_k = P_{k-1} \cdot H_k^T \cdot S_k^{-1}$, where $S_k = H_k \cdot P_{k-1} \cdot H_k^T + R_k$
Make the $k$-th estimate of the unknowns: ${\hat x}_k = {\hat x}_{k-1} + K_k \left( z_k – H_k {\hat x}_{k-1} \right)$
Make the $k$-th update of the covariance matrix: $P_k = \left(I – K_k H_k \right)P_{k-1} $

Using this algorithm, the time evolution of the estimated initial position is

with a final value of -3.95170034.

Likewise, the time evolution of the estimated initial speed is

with a final value of 1.99295453.

And, finally, the time evolution of the estimated acceleration is

with a final value of 9.80040698.

These estimated values, up to the numerical noise introduced by a different set of floating point operations, are identical with the batch least squares numbers.

Uncategorized

May

2024

Time Series 5 – Deriving a Kalman-like Gain

Conrad SchiffApril 26, 2024

Last month’s blog presented one of the most common approaches to handling time series data in real time known as the Kalman gain. As noted, the form of the state update equations is formally

\[ x_{new} = x_{old} + K ( obs_{new} – obs_{pred} ) \; , \]

where $x$ is the state being estimated and $obs$ is some observation made of the state. The subscripts on the observation terms are ‘new’ for the latest actual observation (made by some instrument) and ‘pred’ for what the expected predicted observation would be before we made it.

For the sequential or running average discussed in the first of these series of posts, this general form specifies as

\[ {\bar x} = {\bar x}^{-} + \frac{1}{N} \left( x_N – {\bar x}^{-} \right) \; , \]

where ${\bar x}$ is the estimated average. For exponential smoothing (and, by extension Holt-Winter), discussed in the third of these series of posts, the general form becomes

\[ s_k = s_{k-1} + \alpha \left( x_k – s_{k-1} \right) \; , \]

where $s_k$ is the smoothed form of some time-varying quantity $x_k$. For Kalman filtering, discussed in the fourth of these series of posts, the general form takes on the appearance of

\[ {\hat x}_k = {\hat x}_k^{-} + K_k \left( z_k – H {\hat x}_k^{-} \right) \; , \]

where ${\hat x}_k$ is the revised estimated state at time $t_k$ based on the difference between the current measurement $z_k$ and the a priori estimated state ${\hat x}_k^{-}$ at the same time.

The ‘magic’ in all these cases lies firmly in determining the ‘gain’ that sits outside the difference between the new observation datum and the expected value of the same based on the previous estimate.

In running average and Holt-Winter cases, it is fairly easy to deduce the form of the gain without any formal mathematical manipulations. A few rearrangements and some intuition serve. In the case of the Kalman filter a lot more effort is involved.

As a stepping-stone exercise which has almost all of the ingredients found in the Kalman filter but is both easier to understand and slightly easier to manipulate is the recursive least squares estimator. The presentation below is strongly influenced by the excellent article Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method by Aleksander Haber with mathematical simplifications gleaned from the Wikipedia entry on the Kalman Filter.

Haber’s system of equations models a process in which a constant set of parameters are unknown but can be estimated from a time series of measurements of a time-varying state whose evolution is solely determined by the set of parameters. The measurement equation is given by

\[ z_k = H_k x + v_k \; . \]

The time varying matrix $H_k$ serves to map the constant but unknown state $x$ forward in time. The term $v_k$ represents the measurement noise, which is assumed to be zero-mean Gaussian distributed with a covariance given by

\[ E[v_k v_k^T ] = R \; .\]

The concrete example that Haber provides is an object being subjected to constant acceleration starting from an unknown initial position and speed. The object’s motion obeys

\[ x(t) = x_0 + v_0 t + \frac{1}{2} a t^2 = \left[ \begin{array}{ccc} 1 & t & \frac{1}{2} t^2 \end{array} \right] \left[ \begin{array}{c} x_0 \\ v_0 \\ a \end{array} \right] \equiv H(t) x \; .\]

The conversion from the continuous process matrix $H(t)$ to the discrete time-sampled $H_k$ is trivially done by noting that the time $t_k = k \Delta t$.

Haber claims that a least squares estimation of $x$ can be determined recursively by the process

\[ x_k = x_{k-1} + K_k ( z_k – H_k x_{k-1} ) \; , \]

where $x_k$ is the current estimate of $x$, $z_k$ is the current measurement, and in which the gain $K_k$ will be given by the optimality condition that we want to minimize the error.

To find the explicit form of the gain, we first define the error in state estimation as

\[ \epsilon_k = x – {\hat x}_k \; . \]

Since the parameter set is generically multi-dimensional we will want to minimize the total error

\[ W = E[(\epsilon_k)_1^2] + E[(\epsilon_k)_2^2]+ \ldots + E[(\epsilon_k)_i^2] \; , \]

where the expectation is over all possible realizations of the random error in the measurement. If we form the covariance matrix of the error

\[ P_k = E[ \epsilon_k \epsilon_k^T] \; \]

then the total error is trace of the covariance matrix: $W = Tr(P)$. Taking the derivative of this trace and setting it equal to zero provides the explicit form for the gain as follows.

First, re-express the current estimation error in terms of the previous time estimation and via the update equation

\[ \epsilon_k = x – x_{k-1} – K_k(z_k – H_k x_{k-1}) \; .\]

Using the measurement equation, current estimation error becomes

\[ \epsilon_k = x – x_{k-1} – K_k( H_k x + v_k – H_k x_{k-1} ) \; .\]

This last form re-emphasizes that, in the absence of the measurement noise $v_k$, the determination of the unknown parameter set $x$ would be exact since all the quantity in the parentheses would be identically zero.

Regrouping the terms so that the state variables are gathered separately from the noise yields

\[ \epsilon_k = (1 – K_k H_k) (x – x_{k-1}) – K_k v_k \equiv T_k (x – x_{k-1}) – K_k v_k \; . \]

Now it relatively easy to right multiply the above expression by its transpose

\[ \epsilon_k \epsilon_k^T = T_k (x – x_{k-1} ) (x – x_{k-1} )^T T_k^T + K_k v_k v_k^T K_k^T \\ – K_k v_K (x – x_{k-1} )^T T_k^T – T_k (x – x_{k-1} ) v_k^T K_k^T \; \]

and take an expectation of it to arrive at the covariance matrix

\[P_k = E[ (T_k (x – x_{k-1}) (x – x_{k-1})^T T_k^T] – E[K_k v_k v_k ^T K_k^T] \; . \]

The cross-terms between the state and the measurement noise vanish under the assumption that the noise is zero-mean distributed (i.e., unbiased).

Since $K_k$ and $H_k$ only depend on the time-independent value of the noise covariance $R$, they can be factored out of the expectation and we arrive at

\[ P_k = (1 – K_k H_k) P_{k-1} (1 – K_k H_k)^T + K_k R K_k^T \; , \]

which is the Joseph form of covariance propagation relating the covariance matrix at time $t_k$ to one at an earlier time $t_{k-1}$.

The last step is to take the trace and minimize it. While there are a number of formulae for doing this step that engineers and mathematicians favor, it is easier to use index notation since it frees one from the bother of finding the correct formula and then applying it.

To proceed, expand the Joseph form term-by-term and define $S_k = H_k P_{k-1} H_k^T + R$, to get

\[ P_k = P_{k-1} – K_k H_k P_{k-1} – P_{k-1} H_k^T K_k^T + K_k S_k K_k^T \; . \]

Now, we can suppress the indices on the right since we know that every matrix bears a $k$ subscript except any covariance matrix, which must have a $k-1$ subscript.

The trace is then given by

\[ Tr(P_k) = P_{ii} – K_{ij} H_{j\ell} P_{\ell i} – P_{ij} C_{j\ell}^T K_{\ell i}^T + K_{ij} S_{j \ell} K_{\ell i} \; .\]

The least squares minimization is expressed as

\[ \frac{\partial Tr(P_k)}{\partial K_{st}} = 0 \; . \]

The derivative on the left can be expressed as

\[ \frac{\partial Tr(P_k)}{\partial K_{st}} = – H_{t \ell} P_{\ell s} – P_{sj}H_{jt}^T + S_{t \ell} K_{\ell s}^T + K_{sj} S_{jt} \; .\]

Re-arranging the terms and indices (subject to the transposes) to return to a matrix-multiplication form, yields

\[ \frac{\partial Tr(P_k)}{\partial K} = – 2 P_{k-1} H_k^T + K_k (S_k + S_k^T) \; , \]

where the $k$ index has been restored once the matrix indices have been suppressed.

The last step is to recognize that

\[ S^T = (H P H + R)^T = (H^T)^T P^T H^T + R^T = H P H^T + R \; \]

since both the state and noise covariance matrices are symmetric.

The final step is to set all of that equal to zero and solve to $K_k$. Doing so gives

\[ K_k S_k = P_{k-1} H_k^T \; , \]

which immediately solves to

\[ K_k = P_{k-1} H_k^T S_k^{-1} = P_{k-1} H_k^T \left[ H_k P_{k-1} H_k^T + R \right]^{-1} \; .\]

Next blog, we’ll explore a numerical example of this method and establish that it does deliver the least squares estimation of the parameter set $x$.

Uncategorized

Apr

2024

Time Series 4 – Introduction to Kalman Filtering

Conrad SchiffMarch 29, 2024

This month we turn from the exponential smoothing and moving average techniques for analyzing time series and start a sustained look at the Kalman filter. The Kalman filter has been hailed as one of the most significant mathematical algorithms of the 20^th century. The primary reason for this praise is that the Kalman filtering enables a real time agent to make mathematically precise and supported estimates of some quantity as each data point comes in rather than having to wait until all the data have been collected. This enabling behavior garners Kalman filtering a lot of attention and Russell and Norvig devote part of their chapter on probabilistic reasoning in their textbook Artificial Intelligence: A Modern Approach to it.

Before diving into the mathematical details, it is worth noting that the thing that sets the Kalman filter apart from the Holt-Winters algorithm is that in the latter, we needed to recognize and characterize a reoccurring pattern in our data. For example, in the Red Fin housing sales, the data are conveniently regularly spaced in monthly intervals and seasonal patterns are readily apparent if not always predictable. The Kalman filter imposes no such requirements, hence its usefulness. It is often used in navigation activities within the aerospace field where it is applied to powered flight of a missile as readily as to a low-Earth orbiting spacecraft like Starlink to a libration orbiting spacecraft like JWST.

There are an enormous number of introductions to the Kalman filter each of them with their own specific strengths and weakness. I tend to draw on two works: An Introduction to the Kalman Filter by Greg Welch and Gary Bishop and Poor Man’s Explanation of Kalman Filtering or How I stopped Worrying and Learned to Love Matrix Inversion by Roger M. du Plessis. The latter work is particularly difficult to come by, but I’ve had it around for decades. That said, I also draw on a variety of other references which will be noted when used.

I suppose there are four core mathematical concepts in the Kalman filter: 1) the state of the system can be represented by a list of physically meaningful quantities, 2) these quantities vary or evolve in time in some mathematically describable way, 3) some combination (often nonlinear) of the state variables may be measured, 4) the state variables and the measurements of them are ‘noisy’.

Welch and Bishop describe these four mathematical concepts as follows. The state, which is described by a $n$-dimensional, real array of variables

\[ x \in \mathbb{R}^n \; \]

evolves according to the linear stochastic differential equation

\[ x_k = A x_{k-1} + B u_{k-1} + w_{k-1} \; , \]

where $A$, which is known as the $n \times n$ real-valued process matrix, provides the natural time evolution of the state at earlier time $t = k-1$ to the state at time $t = k$. The quantity $u \in \mathbb{R}^{\ell} $ is a control or forcing term that is absent in the natural dynamics but can be impose externally on the system. The $n \times \ell$, real-valued matrix $B$ maps the controls into the state evolution. The term $w \in \mathbb{R}^n$ is the noise in the dynamics; it typically represents those parts of the natural evolution or of the control that are either not easily modeled or are unknown.

Measurement of the state is represented an $m$-dimensional, real-valued vector $z \in \mathbb{R}^m$ and is related to the state by the equation

\[ z_k = H x_k + v_k \; , \]

where $H$ is a $m \times n$ real-valued matrix and $v \in \mathbb{R}^m$ is the noise in the measurement.

In addition, both noise terms are assumed to be normally distributed with their probabilities described by

\[ p(w_k) = N(0,Q_k) \; \]

and

\[ p(v_k) = N(0,R_k) \; , \]

where $Q_k \in \mathbb{R}^{n \times n}$ is the process noise covariance matrix and $R_k \in \mathbb{R}^{m \times m}$ is the measurement noise covariance matrix. We further assume that they are uncorrelated with each other. Generally, these noise covariances are time-varying since their values typically depend on the state.

Note that both the state and measurement equations are linear. More often than not, Kalman filtering is applied to nonlinear systems but in those cases, the estimation process is linearized and the above equations are used. Whether that is a good idea or not is a discussion for another day.

The role of the Kalman filter is to produce the best estimate of the state at a given time given the above equations. To do so, the filter algorithm draws heavily on concepts from Bayesian statistics to produce a maximum likelihood estimate.

In terms of Bayesian statistics, the algorithm assumes both an a priori state estimate ${\hat x}^{-}_{k}$ and an a posteriori one ${\hat x}_{k}$. The transition from the a priori to the a posteriori is made with

\[ {\hat x}_k = {\hat x}^{-}_{k} + K_k (z_k – H {\hat x}^{-}_k ) \; , \]

where the ‘magic’ all happens due to the Kalman gain $K_k$ defined as

\[ K_k = P^{-}_k H^T \left( H P^{-}_k H^T + R \right) ^{-1} \; , \]

where ‘$T$’ indicates a matrix transpose.

Note that this equation is functionally similar to the running average form calculated as

\[ {\bar x} = {\bar x}^{-} + \frac{1}{N} \left( x_N – {\bar x}^{-} \right) \; . \]

We’ll return to that similarity in a future post.

For now, we will round out the menagerie of mathematical beasts by noting that $P_k$ is the state covariance, a $n \times n$, real-valued matrix, that summarizes the current statistical uncertainty in the state estimation. Formally, it is defined as

\[ P_k = E \left[ (x_k – {\hat x}_k ) (x_k – {\hat x}_k )^T \right] \; , \]

where the ‘$E$’ stands for expectation over the state distribution.

The state covariance can be associated with the a priori estimate, where it is denoted as $P^{-}_k$ or with the a posteriori estimate, where it is denoted by $P_k$. The two are related to each other by

\[ P_k = A P^{-}_k A^T + Q \; . \]

The Kalman algorithm is best summarized using the following flow diagram adapted from the one in Introduction to Random Signals and Applies Kalman Filtering 4^th Edition by Brown and Hwang.

In future blogs, we’ll explore how to derive the magical Kalman gain, some applications of it to agent-based time series estimation, and a variety of related topics. It is worth noting that the notation is fluid, differing from author to author and application to application.

Uncategorized

Mar

2024

Time Series 3 – Exponential Smoothing and Holt-Winter

Conrad SchiffFebruary 23, 2024

In the last post, we examined the Holt-Winter scheme for tracking the level, trend, and seasonal variations in a time series in a sequential fashion with some synthetic data designed to illustrate the algorithm in as clean a way as possible. In this post, we’ll try the Holt-Winter method against real world data for US housing sales and will set some of the context for why the method works by comparing it to a related technique called the moving average.

The data analyzed here were obtained from RedFin (https://www.redfin.com/news/data-center/) but it isn’t clear for how long RedFin will continue to make these data public as they list the data as being ‘temporarily released’. As a result, I’ve linked the data file I’ve used here.

We’re going to approach these data in two ways. The first is by taking a historical look at the patterns in the data from the vantage point of hindsight on the entire span of home sales having been collected. In the second approach, we imagine what an agent working in the past thinks as the data come in one record at a time.

The historical look starts with an overview of the number of homes sold in the time period starting in Feb 2012 and ending at May 2023.

These data show both seasonal and overall trend variations and so our expectation might be that Holt-Winter would do a good job but note two things: First, with the exception of the first pandemic year of 2020, each of the years shows the same pattern: sales are low in the winter months and strong in the summer ones. Second the trend (most easily seen by focusing on the summer peak) shows four distinct regions: a) from 2012-2017 there is an overall upward trend, b) from 2017-2020 the trend in now downward with a much shallower slope, c) the start of the pandemic lockdowns in 2020 breaks the smoothness of the trend and then the trend again has a positive slope over 2020-2021, and d) the trend is strongly downward afterwards. These data exhibit a real-world richness that the contrived data used in the last post did not and they should prove a solid test for a time series analysis agent/algorithm.

Depending on how ‘intelligent’ we want our analysis agent to be we could look at a variety of other factors to explain or inform these features. For our purposes, we’ll content ourselves with looking at one other parameter, the median home sales price, mostly to satisfy our human curiosity.

These data look much more orderly in their trend and seasonal variation over the time span from 2012-2020. Afterwards, there isn’t a clear pattern in terms of trend and season.

Our final historical analysis will be to try to understand the overall pattern of the data using a moving average defined as:

\[ {\bar x}_{k,n} = \frac{1}{n} \sum_{i =k-n/2}^{k+n/2} x_i \; . \]

The index $k$ specifies to which point of the underlying and $n$ the number of points to be used in the moving average. Despite the notation, $n$ is best when odd so that there are as many points before the $k$th one as there are after as this prevents the moving average from introducing a bias which shifts a peak in the average off of the place in the data where it occurs. In addition, there is an art in the selection of the value of $n$ between it being too small, thereby failing to smooth out unwanted fluctuations, and being too large which smears out the desired patterns. For these data, $n = 5$. The resulting moving average (in solid black overlaying the original data in the red dashed line) is:

Any agent using this technique would clearly be able to describe the data as having a period of one year with a peak in the middle and perhaps an overall upward trend from 2012 to 2022 but then a sharp decline afterwards. But two caveats are in order. First and the most important one, the agent employing this technique to estimate a smoothed value on the $k$th time step must wait until at least $n/2$ additional future points have come in. This requirement usually precludes being able to perform predictions in real time. The second is that the moving average is computationally burdensome when $n$ is large.

By contrast, the Holt-Winter method can be used by an agent needing to analyze in real time and it is computationally clean. At the heart of the Holt-Winter algorithm is the notion of exponential smoothing where the smoothed value at the $k$th step, $s_k$, is determined by the previous smoothed value $s_{k-1}$ and the current raw value $x_k$ according to

\[ s_k = \alpha x_k + (1-\alpha) s_{k-1} \; . \]

Since $s_{k-1}$ was determined from a similar expression at the time point $k-1$, one can back substitute to eliminate all the smoothed values $s$ on the right-hand side in favor of the raw ones $x$ to get

\[ s_k = \alpha x_k + (1-\alpha)x_{k-1} + (1-\alpha)^2 x_{k-2} + \cdots + (1-\alpha)^k x_0 \; . \]

This expression shows that the smoothed value $s_k$ is a weighted average of all the previous points making it analogous to the sequential averaging discussed in a previous post but the exponential weighting by $(1-\alpha)^n$ makes the resulting sequence $s_k$ look more like the moving average. In some sense, the exponential smoothing straddles the sequential and moving averages giving the computational convenience of the former while providing the latter’s ability to follow variations and trends.

How closely the exponentially smoothed sequence matches a given $n$-point moving average depends on the selection of the value of $\alpha$. For example, with $\alpha = 0.2$ the exponentially smoothed curve gives

whereas $\alpha = 0.4$ gives

Of the two of these, the one with $\alpha=0.4$ much more closely matches the 5-point moving average used above.

The Holt-Winter approach using three separate applications of exponential smoothing, hence the need for the three specified parameters $\alpha$, $\beta$, and $\gamma$. Leslie Major presents an method for optimizing the selection of these three parameters in her video How to Holts Winters Method in Excel & optimize Alpha, Beta & Gamma.

We’ll skip this step and simply use some values informed by the best practices that Major (and other YouTubers) note.

The long-term predictions given by our real time agent are pretty good in the time span 2013-2018. For example, a 24-month prediction made in February 2013 looks like

Likewise, a 24-month prediction in June 2017 looks like

Both have good agreement with a few areas of over or under-estimation. The most egregious error is the significant overshoot in 2019 which is absent in the 12-month prediction made a year later.

All told, the real time agent does an excellent job of predicting in the moment but it isn’t perfect as is seen by how the one-month predictions falter when the pandemic hit.