Time Series 5 – Deriving a Kalman-like Gain

Last month’s blog presented one of the most common approaches to handling time series data in real time known as the Kalman gain.  As noted, the form of the state update equations is formally

\[ x_{new} = x_{old} + K ( obs_{new} – obs_{pred} ) \; , \]

where $x$ is the state being estimated and $obs$ is some observation made of the state.  The subscripts on the observation terms are ‘new’ for the latest actual observation (made by some instrument) and ‘pred’ for what the expected predicted observation would be before we made it. 

For the sequential or running average discussed in the first of these series of posts, this general form specifies as

\[ {\bar x} = {\bar x}^{-} + \frac{1}{N} \left( x_N – {\bar x}^{-} \right) \; , \]

where ${\bar x}$ is the estimated average.  For exponential smoothing (and, by extension Holt-Winter), discussed in the third of these series of posts, the general form becomes

\[ s_k = s_{k-1} + \alpha \left( x_k – s_{k-1} \right) \; , \]

where $s_k$ is the smoothed form of some time-varying quantity $x_k$.  For Kalman filtering, discussed in the fourth of these series of posts, the general form takes on the appearance of

\[ {\hat x}_k = {\hat x}_k^{-} + K_k \left( z_k – H {\hat x}_k^{-} \right) \; , \]

where ${\hat x}_k$ is the revised estimated state at time $t_k$ based on the difference between the current measurement $z_k$ and the a priori estimated state ${\hat x}_k^{-}$ at the same time. 

The ‘magic’ in all these cases lies firmly in determining the ‘gain’ that sits outside the difference between the new observation datum and the expected value of the same based on the previous estimate. 

In running average and Holt-Winter cases, it is fairly easy to deduce the form of the gain without any formal mathematical manipulations.  A few rearrangements and some intuition serve.  In the case of the Kalman filter a lot more effort is involved.

As a stepping-stone exercise which has almost all of the ingredients found in the Kalman filter but is both easier to understand and slightly easier to manipulate is the recursive least squares estimator.  The presentation below is strongly influenced by the excellent article Introduction to Kalman Filter: Derivation of the Recursive Least Squares Method by Aleksander Haber with mathematical simplifications gleaned from the Wikipedia entry on the Kalman Filter

Haber’s system of equations models a process in which a constant set of parameters are unknown but can be estimated from a time series of measurements of a time-varying state whose evolution is solely determined by the set of parameters.  The measurement equation is given by

\[ z_k = H_k x + v_k \; . \]

The time varying matrix $H_k$ serves to map the constant but unknown state $x$ forward in time.  The term $v_k$ represents the measurement noise, which is assumed to be zero-mean Gaussian distributed with a covariance given by

\[ E[v_k v_k^T ] = R \; .\]

The concrete example that Haber provides is an object being subjected to constant acceleration starting from an unknown initial position and speed.  The object’s motion obeys

\[ x(t) = x_0 + v_0 t + \frac{1}{2} a t^2 =  \left[ \begin{array}{ccc} 1 & t & \frac{1}{2} t^2 \end{array} \right] \left[ \begin{array}{c} x_0 \\ v_0 \\ a \end{array} \right]  \equiv H(t) x \; .\]

The conversion from the continuous process matrix $H(t)$ to the discrete time-sampled $H_k$ is trivially done by noting that the time $t_k = k \Delta t$. 

Haber claims that a least squares estimation of $x$ can be determined recursively by the process

\[ x_k = x_{k-1} + K_k ( z_k – H_k x_{k-1} ) \; , \]

where $x_k$ is the current estimate of $x$, $z_k$ is the current measurement, and in which the gain $K_k$ will be given by the optimality condition that we want to minimize the error.

To find the explicit form of the gain, we first define the error in state estimation as

\[ \epsilon_k = x – {\hat x}_k \; . \]

Since the parameter set is generically multi-dimensional we will want to minimize the total error

\[ W = E[(\epsilon_k)_1^2] + E[(\epsilon_k)_2^2]+ \ldots + E[(\epsilon_k)_i^2] \; , \]

where the expectation is over all possible realizations of the random error in the measurement.  If we form the covariance matrix of the error

\[ P_k = E[ \epsilon_k \epsilon_k^T] \; \]

then the total error is trace of the covariance matrix: $W = Tr(P)$.  Taking the derivative of this trace and setting it equal to zero provides the explicit form for the gain as follows.

First, re-express the current estimation error in terms of the previous time estimation and via the update equation

\[ \epsilon_k = x – x_{k-1} – K_k(z_k – H_k x_{k-1}) \;  .\]

Using the measurement equation, current estimation error becomes

\[ \epsilon_k = x – x_{k-1} – K_k( H_k x + v_k – H_k x_{k-1} ) \; .\]

This last form re-emphasizes that, in the absence of the measurement noise $v_k$, the determination of the unknown parameter set $x$ would be exact since all the quantity in the parentheses would be identically zero.

Regrouping the terms so that the state variables are gathered separately from the noise yields

\[ \epsilon_k = (1 – K_k H_k) (x – x_{k-1}) – K_k v_k \equiv T_k (x – x_{k-1}) – K_k v_k \; . \]

Now it relatively easy to right multiply the above expression by its transpose

\[ \epsilon_k \epsilon_k^T = T_k (x – x_{k-1} ) (x – x_{k-1} )^T T_k^T + K_k v_k v_k^T K_k^T \\ – K_k v_K (x – x_{k-1} )^T T_k^T – T_k (x – x_{k-1} ) v_k^T K_k^T \; \]

and take an expectation of it to arrive at the covariance matrix

\[P_k = E[ (T_k (x – x_{k-1}) (x – x_{k-1})^T T_k^T]  – E[K_k v_k v_k ^T K_k^T] \; . \]

The cross-terms between the state and the measurement noise vanish under the assumption that the noise is zero-mean distributed (i.e., unbiased). 

Since $K_k$ and $H_k$ only depend on the time-independent value of the noise covariance $R$, they can be factored out of the expectation and we arrive at

\[ P_k = (1 – K_k H_k) P_{k-1} (1 – K_k H_k)^T + K_k R K_k^T \; , \]

which is the Joseph form of covariance propagation relating the covariance matrix at time $t_k$ to one at an earlier time $t_{k-1}$.

The last step is to take the trace and minimize it.  While there are a number of formulae for doing this step that engineers and mathematicians favor, it is easier to use index notation since it frees one from the bother of finding the correct formula and then applying it.

To proceed, expand the Joseph form term-by-term and define $S_k = H_k P_{k-1} H_k^T + R$, to get

\[ P_k = P_{k-1} – K_k H_k P_{k-1} – P_{k-1} H_k^T K_k^T + K_k S_k K_k^T \; . \]

Now, we can suppress the indices on the right since we know that every matrix bears a $k$ subscript except any covariance matrix, which must have a $k-1$ subscript.

The trace is then given by

\[ Tr(P_k) = P_{ii} – K_{ij} H_{j\ell} P_{\ell i} – P_{ij} C_{j\ell}^T K_{\ell i}^T + K_{ij} S_{j \ell} K_{\ell i} \; .\]

The least squares minimization is expressed as

\[ \frac{\partial Tr(P_k)}{\partial K_{st}} = 0 \; . \]

The derivative on the left can be expressed as

\[ \frac{\partial Tr(P_k)}{\partial K_{st}} = – H_{t \ell} P_{\ell s} – P_{sj}H_{jt}^T + S_{t \ell} K_{\ell s}^T + K_{sj} S_{jt} \; .\]

Re-arranging the terms and indices (subject to the transposes) to return to a matrix-multiplication form, yields

\[ \frac{\partial Tr(P_k)}{\partial K} = – 2 P_{k-1} H_k^T + K_k (S_k + S_k^T) \; , \]

where the $k$ index has been restored once the matrix indices have been suppressed.

The last step is to recognize that

\[ S^T = (H P H + R)^T = (H^T)^T P^T H^T + R^T = H P H^T + R \; \]

since both the state and noise covariance matrices are symmetric.

The final step is to set all of that equal to zero and solve to $K_k$.  Doing so gives

\[ K_k S_k = P_{k-1} H_k^T \; , \]

which immediately solves to

\[ K_k = P_{k-1} H_k^T S_k^{-1} = P_{k-1} H_k^T \left[ H_k P_{k-1} H_k^T + R \right]^{-1} \; .\]

Next blog, we’ll explore a numerical example of this method and establish that it does deliver the least squares estimation of the parameter set $x$.