Computing Applications Review articles

An Elementary Introduction to Kalman Filtering

Demystifying the uses of a powerful tool for uncertain information.

By Yan Pei, Swarnendu Biswas, Donald S. Fussell, and Keshav Pingali

Posted Nov 1 2019

Introduction
Key Insights
Formalizing Estimates
Fusing Scalar Estimates
Fusing Vector Estimates
Best Linear Unbiased Estimator (BLUE)
Kalman Filters for Linear Systems
Extension to Nonlinear Systems
Conclusion
Acknowledgments
References
Authors
Footnotes

Kalman filtering is a state estimation technique used in many application areas such as spacecraft navigation, motion planning in robotics, signal processing, and wireless sensor networks because of its ability to extract useful information from noisy data and its small computational and memory requirements.^{12,20,27,28,29} Recent work has used Kalman filtering in controllers for computer systems.^5,13,14,23

Key Insights

This article presents an elementary derivation of Kalman filtering, a classic state estimation technique.
Understanding Kalman filtering is useful for more principled control of computer systems.
Kalman filtering is used as a black box by rnany computer scientists.

Although many introductions to Kalman filtering are available in the literature,^{1,2,3,4,6,7,8,9,10,11,17,21,25,29} they are usually focused on particular applications such as robot motion or state estimation in linear systems, making it difficult to see how to apply Kalman filtering to other problems. Other presentations derive Kalman filtering as an application of Bayesian inference, assuming that noise is Gaussian. This leads to the common misconception that Kalman filtering can be applied only if noise is Gaussian.¹⁵

Abstractly, Kalman filtering can be seen as a particular approach to combining approximations of an unknown value to produce a better approximation. Suppose we use two devices of different designs to measure the temperature of a CPU core. Because devices are usually noisy, the measurements are likely to differ from the actual temperature of the core. As the devices are of different designs, let us assume that noise affects the two devices in unrelated ways (this is formalized here using the notion of correlation). Therefore, the measurements x₁ and x₂ are likely to be different from each other and from the actual core temperature x_c. A natural question is the following: is there a way to combine the information in the noisy measurements x₁ and x₂ to obtain a good approximation of the actual temperature x_c?

One ad hoc solution is to use the formula 0.5*x₁+0.5*x₂ to take the average of the two measurements, giving them equal weight. Formulas of this sort are called linear estimators because they use a weighted sum to fuse values; for our temperature problem, their general form is β*x₁+α*x₂. In this presentation, we use the term estimate to refer to both a noisy measurement and a value computed by an estimator, as both are approximations of unknown values of interest.

Suppose we have additional information about the two devices, say the second one uses more advanced temperature sensing. Because we would have more confidence in the second measurement, it seems reasonable that we should discard the first one, which is equivalent to using the linear estimator 0.0*x₁ + 1.0*x₂. Kalman filtering tells us that in general, this intuitively reasonable linear estimator is not “optimal;” paradoxically, there is useful information even in the measurement from the lower quality device, and the optimal estimator is one in which the weight given to each measurement is proportional to the confidence we have in the device producing that measurement. Only if we have no confidence whatever in the first device should we discard its measurement.

The goal of this article^a is to present the abstract concepts behind Kalman filtering in a way that is accessible to most computer scientists while clarifying the key assumptions, and then show how the problem of state estimation in linear systems can be solved as an application of these general concepts. First, the informal ideas discussed here are formalized using the notions of distributions and random samples from distributions. Confidence in estimates is quantified using the variances and covariances of these distributions.^b Two algorithms are described next. The first one shows how to fuse estimates (such as core temperature measurements) optimally, given a reasonable definition of optimality. The second algorithm addresses a problem that arises frequently in practice: estimates are vectors (for example, the position and velocity of a robot), but only a part of the vector can be measured directly; in such a situation, how can an estimate of the entire vector be obtained from an estimate of just a part of that vector? The best linear unbiased estimator (BLUE) is used to solve this problem.^16,19,26 It is shown that the Kalman filter can be derived in a straightforward way by using these two algorithms to solve the problem of state estimation in linear systems. The extended Kalman filter and unscented Kalman filter, which extended Kalman filtering to nonlinear systems, are described briefly at the end of the article.

Formalizing Estimates

Scalar estimates. To model the behavior of devices producing noisy temperature measurements, we associate each device i with a random variable that has a probability density function (pdf) p_i(x) such as the ones shown in Figure 1 (the x-axis in this figure represents temperature). Random variables need not be Gaussian.^c Obtaining a measurement from device i corresponds to drawing a random sample from the distribution for that device. We write to denote that x_i is a random variable with pdf p_i whose mean and variance are μ_i and , respectively; following convention, we use x_i to represent a random sample from this distribution as well.

Figure 1. Using pdfs to model devices with systematic and random errors. Ground truth is 60°C. Dashed lines are means of pdfs.

Means and variances of distributions model different kinds of inaccuracies in measurements. Device i is said to have a systematic error or bias in its measurements if the mean μ_i of its distribution is not equal to the actual temperature x_c (in general, to the value being estimated, which is known as ground truth); otherwise, the instrument is unbiased. Figure 1 shows pdfs for two devices that have different amounts of systematic error. The variance on the other hand is a measure of the random error in the measurements. The impact of random errors can be mitigated by taking many measurements with a given device and averaging their values, but this approach will not reduce systematic error.

In the formulation of Kalman filtering, it is assumed that measuring devices do not have systematic errors. However, we do not have the luxury of taking many measurements of a given state, so we must take into account the impact of random error on a single measurement. Therefore, confidence in a device is modeled formally by the variance of the distribution associated with that device; the smaller the variance, the higher our confidence in the measurements made by the device. In Figure 1, the fact we have less confidence in the first device has been illustrated by making p₁ more spread out than p₂, giving it a larger variance.

The informal notion that noise should affect the two devices in “unrelated ways” is formalized by requiring that the corresponding random variables be uncorrelated. This is a weaker condition than requiring them to be independent, as explained in our online appendix (http://dl.acm.org/citation.cfm?doid=3363294&picked=formats). Suppose we are given the measurement made by one of the devices (say x₁) and we have to guess what the other measurement (x₂) might be. If knowing x₁ does not give us any new information about what x₂ might be, the random variables are independent. This is expressed formally by the equation p(x₂|x₁) = p(x₂); intuitively, knowing the value of x₁ does not change the pdf for the possible values of x₂. If the random variables are only uncor-related, knowing x₁ might give us new information about x₂ such as restricting its possible values but the mean of x₂|x₁ will still be μ₂. Using expectations, this can be written as E[x₂|x₁] = E[x₂], which is equivalent to requiring that E[(x₁–μ₁)(x₂–μ₂)], the covariance between the two variables, be equal to zero. This is obviously a weaker condition than independence.

Although the discussion in this section has focused on measurements, the same formalization can be used for estimates produced by an estimator. Lemma shows how the mean and variance of a linear combination of pairwise uncorrelated random variables can be computed from the means and variances of the random variables.¹⁸ The mean and variance can be used to quantify bias and random errors for the estimator as in the case of measurements.

An unbiased estimator is one whose mean is equal to the unknown value being estimated and it is preferable to a biased estimator with the same variance. Only unbiased estimators are considered in this article. Furthermore, an unbiased estimator with a smaller variance is preferable to one with a larger variance as we would have more confidence in the estimates it produces. As a step toward generalizing this discussion to estimators that produce vector estimates, we refer to the variance of an unbiased scalar estimator as the mean square error of that estimator or MSE for short.

Lemma 1(ii) asserts that if a random variable is pairwise uncorrelated with a set of random variables, it is uncor-related with any linear combination of those variables.

Lemma 1. Let be a set of pairwise uncorrelated random variables. Let be a random variable that is a linear combination of the x_i‘s.

The mean and variance of y are:

ii. If random variable x_n+1 is pair-wise uncorrelated with x₁,…,x_n, it is uncorrelated with y.

Vector estimates. In some applications, estimates are vectors. For example, the state of a mobile robot might be represented by a vector containing its position and velocity. Similarly, the vital signs of a person might be represented by a vector containing his temperature, pulse rate, and blood pressure. Here, we denote a vector by a boldfaced lowercase letter, and a matrix by an uppercase letter.

The covariance matrix Σ_xx of a random variable x is the matrix E[(x – μ_x) (x – μ_x)^T], where μ_x is the mean of x. Intuitively, entry (i,j) of this matrix is the covariance between the i and j components of vector x; in particular, entry (i,i) is the variance of the i^th component of x. A random variable x with a pdf p whose mean is μ_x and covariance matrix is Σ_xx is written as x~p(μ_x, Σ_xx). The inverse of the covariance matrix is called the precision or information matrix.

Uncorrelated random variables. The cross-covariance matrix Σ_vw of two random variables v and w is the matrix E[(v–μ_v)(w–μ_w)^T]. Intuitively, element (i,j) of this matrix is the covariance between elements v(i) and w(j). If the random variables are uncorrelated, all entries in this matrix are zero, which is equivalent to saying that every component of v is uncorrelated with every component of w. Lemma 2 generalizes Lemma 1.

Lemma 2. Let x₁~p₁(μ₁, Σ₁), …, x_n~p_n(μ_n, Σ_n) be a set of pairwise uncorrelated random variables of length m. Let .

The mean and covariance matrix of y are the following:

ii. If random variable x_n+1 is pairwise uncorrelated with x₁, .., x_n, it is uncorrelated with y.

The MSE of an unbiased estimator y is E[(y–μ_y)^T(y–μ_y)], which is the sum of the variances of the components of y; if y has length 1, this reduces to variance as expected. The MSE is also the sum of the diagonal elements of Σ_yy (this is called the trace of Σ_yy).

Fusing Scalar Estimates

We now consider the problem of choosing the optimal values of the parameters α and β in the linear estimator β*x₁ + α*x₂ for fusing two estimates x₁ and x₂ from uncorrelated scalar-valued random variables.

An unbiased estimator is one whose mean is equal to the unknown value being estimated and it is preferable to a biased estimator with the same variance.

The first reasonable requirement is that if the two estimates x₁ and x₂ are equal, fusing them should produce the same value. This implies that α+β =1. Therefore, the linear estimators of interest are of the form

If x₁ and x₂ in Equation 5 are considered to be unbiased estimators of some quantity of interest, then y_α is an unbiased estimator for any value of α. How should optimality of such an estimator be defined? One reasonable definition is that the optimal value of α minimizes the variance of y_α as this will produce the highest-confidence fused estimates.

Theorem 1. Let and be uncorrelated random variables. Consider the linear estimator y_α(x₁,x₂)=(1-α)*x₁+α*x₂. The variance of the estimator is minimized for .

The proof is straightforward and is given in the online appendix. The variance (MSE) of y_α can be determined from Lemma 1:

Setting the derivative of with respect to α to zero and solving the resulting equation yield the required result.

In the literature, the optimal value of α is called the Kalman gain K. Substituting K into the linear fusion model, we get the optimal linear estimator y(x₁, x₂):

As a step toward fusion of n>2 estimates, it is useful to rewrite this as follows:

Substituting the optimal value of α into Equation 6, we get

The expressions for y and are complicated because they contain the reciprocals of variances. If we let v₁ and v₂ denote the precisions of the two distributions, the expressions for y and v_y can be written more simply as follows:

These results say the weight we should give to an estimate is proportional to the confidence we have in that estimate, and that we have more confidence in the fused estimate than in the individual estimates, which is intuitively reasonable. To use these results, we need only the variances of the distributions. In particular, the pdfs p_i, which are usually not available in applications, are not needed, and the proof of Theorem 1 does not require these pdfs to have the same mean.

Kalman filtering can be seen as a particular approach to combining approximations of an unknown value to produce a better approximation.

Fusing multiple scalar estimates. These results can be generalized to optimally fuse multiple pairwise uncorrelated estimates x₁, x₂, …, x_n. Let y_n,α(x₁, .., x_n) denote the linear estimator for fusing the n estimates given parameters α₁, .., α_n, which we denote by α (the notation y_α(x₁, x₂) introduced previously can be considered to be an abbreviation of y_2,α(x₁, x₂)).

Theorem 2. Let for (1≤i≤n) be a set of pairwise uncorrelated random variables. Consider the linear estimator where . The variance of the estimator is minimized for

The minimal variance is given by the following expression:

As before, these expressions are more intuitive if the variance is replaced with precision: the contribution of x_i to the value of y_n(x₁, .., x_n) is proportional to x_i‘s confidence.

Equations 13 and 14 generalize Equations 10 and 11.

Incremental fusing is optimal. In many applications, the estimates x₁, x₂, …, x_n become available successively over a period of time. Although it is possible to store all the estimates and use Equations 13 and 14 to fuse all the estimates from scratch whenever a new estimate becomes available, it is possible to save both time and storage if one can do this fusion incrementally. We show that just as a sequence of numbers can be added by keeping a running sum and adding the numbers to this running sum one at a time, a sequence of n>2 estimates can be fused by keeping a “running estimate” and fusing estimates from the sequence one at a time into this running estimate without any loss in the quality of the final estimate. In short, we want to show that y_n(x₁,..,x_n)=y₂(y₂(..y₂(x₁,x₂)…),x_n). A little bit of algebra shows that if n>2, Equations 13 and 14 for the optimal linear estimator and its precision can be expressed as shown in Equations 15 and 16.

This shows that y_n(x₁, ..,x_n) = y₂(y_n-1 (x₁, .., x_n-1), x_n). Using this argument recursively gives the required result.^d

To make the connection to Kalman filtering, it is useful to derive the same result using a pictorial argument. Figure 2 shows the process of incrementally fusing the n estimates. In this picture, time progresses from left to right, the precision of each estimate is shown in parentheses next to it, and the weights on the edges are the weights from Equation 10. The contribution of each x_i to the final value y₂(y₂(…), x_n) is given by the product of the weights on the path from x_i to the final value, and this product is obviously equal to the weight of x_i in Equation 13, showing that incremental fusion is optimal.

Figure 2. Dataflow graph for incremental fusion.

Summary. The results in this section can be summarized informally as follows. When using a linear estimator to fuse uncertain scalar estimates, the weight given to each estimate should be inversely proportional to the variance of the random variable from which that estimate is obtained. Furthermore, when fusing n>2 estimates, estimates can be fused incrementally without any loss in the quality of the final result. These results are often expressed formally in terms of the Kalman gain K, as shown in Figure 3; the equations can be applied recursively to fuse multiple estimates. Note that if v₁≫v₂, K≈0 and y(x₁,x₂)≈x₁; conversely if v₁≪v₂, K≈1 and y(x₁,x₂)≈x₂.

Figure 3. Optimal fusion of scalar estimates.

Fusing Vector Estimates

The results for fusing scalar estimates can be extended to vectors by replacing variances with covariance matrices.

For vectors, the linear estimator is where . Here A stands for the matrix parameters (A₁, …, A_n). All the vectors (x_i) are assumed to be of the same length. To simplify notation, we omit the subscript n in y_n,A in the discussion here as it is obvious from the context.

Optimality. The parameters A₁, …, A_n in the linear data fusion model are chosen to minimize MSE(y_A) which is E[(y_A – μ_yA)^T(y_A – μ_yA)].

Theorem 3 generalizes Theorem 2 to the vector case. The proof of this theorem is given in the appendix. Comparing Theorems 2 and 3, we see that the expressions are similar, the main difference being that the role of variance in the scalar case is played by the covariance matrix in the vector case.

Theorem 3. Let x_i∼p_i(μ_i, ∑_i) for (1≤i≤n) be a set of pairwise uncorrelated random variables. Consider the linear estimator where . The value of MSE(y_A) is minimized for

Therefore the optimal estimator is

The covariance matrix of y can be computed by using Lemma 2.

In the vector case, precision is the inverse of a covariance matrix, denoted by N. Equations 26–27 use precision to express the optimal estimator and its variance and generalize Equations 13–14 to the vector case.

As in the scalar case, fusion of n>2 vector estimates can be done incrementally without loss of precision. The proof is similar to the scalar case and is omitted.

There are several equivalent expressions for the Kalman gain for the fusion of two estimates. The following one, which is easily derived from Equation 23, is the vector analog of Equation 17:

The covariance matrix of the optimal estimator y(x₁, x₂) is the following.

Summary. The results in this section can be summarized in terms of the Kalman gain K as shown in Figure 4.

Figure 4. Optimal fusion of vector estimates.

Best Linear Unbiased Estimator (BLUE)

In some applications, the state of the system is represented by a vector but only part of the state can be measured directly, so it is necessary to estimate the hidden portion of the state corresponding to a measured value of the visible state. This section describes an estimator called the best linear unbiased estimator (BLUE)^16,19,26 for doing this.

Consider the general problem of determining a value for vector y given a value for a vector x. If there is a functional relationship between x and y (say y=F(x) and F is given), it is easy to compute y given a value for x (say x₁).

In our context, however, x and y are random variables, so such a precise functional relationship will not hold. Figure 5 shows an example in which x and y are scalar-valued random variables. The gray ellipse in this figure, called a confidence ellipse, is a projection of the joint distribution of x and y onto the (x, y) plane that shows where some large proportion of the (x, y) values are likely to be. Suppose x takes the value x₁. Even within the confidence ellipse, there are many points (x₁, y), so we cannot associate a single value of y with x₁. One possibility is to compute the mean of the y values associated with x₁ (that is, the expectation E[y|x=x₁]) and return this as the estimate for y if x=x₁. This requires knowing the joint distribution of x and y, which may not always be available.

Figure 5. BLUE line corresponding to Equation (31).

In some problems, we can assume that there is an unknown linear relationship between x and y and that uncertainty comes from noise. Therefore, we can use a technique similar to the ordinary least squares (OLS) method to estimate this linear relationship, and use it to return the best estimate of y for any given value of x. In Figure 5, we see that although there are many points (x₁, y), the y values are clustered around the line as shown in the figure so the value ŷ₁ is a reasonable estimate for the value of y corresponding to x₁. This line, called the best linear unbiased estimator (BLUE), is the analog of ordinary least squares (OLS) for distributions.

Computing BLUE. Consider the estimator ŷ_A,b(x) = Ax + b. We choose A and b so that this is an unbiased estimator with minimal MSE. The “∧” over the y is notation that indicates that we are computing an estimate for y.

Theorem 4. Let

The estimator ŷ_A,b(x) = Ax + b. for estimating the value of y for a given value of x is an unbiased estimator with minimal MSE if

The proof of Theorem 4 is straight forward. For an unbiased estimator, E[ŷ] = E[y]. This implies that b=μ_y–A(μ_x) so an unbiased estimator is of the form ŷ_A(x) = μ_y + A[x – μ_x). Note this is equivalent to asserting the BLUE line must pass through the point (μ_x, μ_y). Setting the derivative of MSE_A(ŷ_A) with respect to A to zero²² and solving for A, we find that the best linear unbiased estimator is

This equation can be understood intuitively as follows. If we have no information about x and y, the best we can do is the estimate (μ_x, μ_y), which lies on the BLUE line. However, if we know that x has a particular value x₁, we can use the correlation between y and x to estimate a better value for y from the difference (x₁ – μ_x). Note that if ∑_yx = 0 (that is, x and y are uncorrelated), the best estimate of y is just μ_y, so knowing the value of x does not give us any additional information about y as one would expect. In Figure 5, this corresponds to the case when the BLUE line is parallel to the x-axis. At the other extreme, suppose that y and x are functionally related so y = Cx. In that case, it is easy to see that Σ_yx = CΣ_xx, so ŷ = Cx as expected. In Figure 5, this corresponds to the case when the confidence ellipse shrinks down to the BLUE line.

Equation 31 is a generalization of ordinary least squares in the sense that if we compute the relevant means and variances of a set of discrete data (x_i, y_i) and substitute into Equation 31, we get the same line that is obtained by using OLS.

Kalman Filters for Linear Systems

We now apply the algorithms for optimal fusion of vector estimates (Figure 4) and the BLUE estimator (Theorem 4) to the particular problem of state estimation in linear systems, which is the classical application of Kalman filtering.

Figure 6a shows how the evolution of the state of such a system over time can be computed if the initial state x₀ and the model of the system dynamics are known precisely. Time advances in discrete steps. The state of the system at any time step is a function of the state of the system at the previous time step and the control inputs applied to the system during that interval. This is usually expressed by an equation of the form x_t = f_t(x_t–1, u_t) where u_t is the control input. The function f_t is nonlinear in the general case, and can be different for different steps. If the system is linear, the relation for state evolution over time can be written as x_t = F_tx_t-1 + B_tu_t, where F_t and B_t are time-dependent matrices that can be determined from the physics of the system. Therefore, if the initial state x₀ is known exactly and the system dynamics are modeled perfectly by the F_t and B_t matrices, the evolution of the state over time can be computed precisely as shown in Figure 6a.

Figure 6. State estimation using Kalman filtering.

In general, however, we may not know the initial state exactly, and the system dynamics and control inputs may not be known precisely. These inaccuracies may cause the state computed by the model to diverge unacceptably from the actual state over time. To avoid this, we can make measurements of the state after each time step. If these measurements were exact, there would of course be no need to model the system dynamics. However, in general, the measurements themselves are imprecise.

Kalman filtering was invented to solve the problem of state estimation in such systems. Figure 6b shows the dataflow of the computation, and we use it to introduce standard terminology. An estimate of the initial state, denoted by , is assumed to be available. At each time step t=1, 2, .., the system model is used to provide an estimate of the state at time t using information from time t–1. This step is called prediction and the estimate that it provides is called the a priori estimate and denoted by . The a priori estimate is then fused with z_t, the state estimate obtained from the measurement at time t, and the result is the a posteriori state estimate at time t, denoted by . This a posteriori estimate is used by the model to produce the a priori estimate for the next time step and so on. As described here, the a priori and a posteriori estimates are the means of certain random variables; the covariance matrices of these random variables are shown within parentheses each estimate in Figure 6b, and these are used to weight estimates when fusing them.

We first present the state evolution model and a priori state estimation. Then we discuss how state estimates are fused if an estimate of the entire state can be obtained by measurement. Finally, we discuss how to address this problem when only a portion of the state can be measured directly.

State evolution model and prediction. The evolution of the state over time is described by a series of random variables x₀, x₁, x₂,…

The random variable x₀ captures the likelihood of different initial states.
The random variables at successive time steps are related by the following linear model:

Here, u_t is the control input, which is assumed to be deterministic, and w_t is a zero-mean noise term that models all the uncertainty in the system. The covariance matrix of w_t is denoted by Q_t, and the noise terms in different time steps are assumed to be uncorrelated to each other (such as, E[w_iw_j]=0 if i≠j) and to x₀.

For estimation, we have a random variable x_0|0 that captures our belief about the likelihood of different states at time t=0, and two random variables x_t|t–1 and x_t|t at each time step t = 1, 2, … that capture our beliefs about the likelihood of different states at time t before and after fusion with the measurement, respectively. The mean and covariance matrix of a random variable x_i|j are denoted by . and Σ_i|j, respectively. We assume (no bias).

Prediction essentially uses x_t–1|t–1 as a proxy for x_t–1 in Equation 32 to determine x_t|t–1 as shown in Equation 33.

For state estimation, we need only the mean and covariance matrix of x_t|t–1. The predictor box in Figure 6 computes these values; the covariance matrix is obtained from Lemma 2 under the assumption that w_t is uncorrelated with x_t–1|t–1, which is justified here.

Fusing complete observations of the state. If the entire state can be measured at each time step, the imprecise measurement at time t is modeled as follows:

where v_t is a zero-mean noise term with covariance matrix R_t. The noise terms in different time steps are assumed to be uncorrelated with each other (such as, E[v_iv_j] is zero if i≠j) as well as with x_0|0 and all w_k. A subtle point here is that x_t in this equation is the actual state of the system at time t (that is, a particular realization of the random variable x_t), so variability in z_t comes only from v_t and its covariance matrix R_t.

Therefore, we have two imprecise estimates for the state at each time step t = 1, 2, …, the a priori estimate from the predictor and the one from the measurement (z_t). If z_t is uncorrelated to x_t|t–1, we can use Equations 20–22 to fuse the estimates as shown in Figure 6c.

The assumptions that (i) x_t–1|t–1 is uncorrelated with w_t, which is used in prediction, and (ii) x_t|t–1 is uncorrelated with z_t, which is used in fusion, are easily proved to be correct by induction on t, using Lemma 2(ii). Figure 6b gives the intuition: x_t|t–1 for example is an affine function of the random variables x_0|0, w₁, v₁, w₂, v₂, …, w_t, and is therefore uncorrelated with v_t (by assumption about v_t and Lemma 2(ii)) and hence with z_t.

Figure 7 shows the computation pictorially using confidence ellipses to illustrate uncertainty. The dotted arrows at the bottom of the figure show the evolution of the state, and the solid arrows show the computation of the a priori estimates and their fusion with measurements.

Figure 7. Illustration of Kalman filtering.

Fusing partial observations of the state. In some problems, only a portion of the state can be measured directly. The observable portion of the state is specified formally using a full row-rank matrix H_t called the observation matrix: if the state is x, what is observable is H_tx. For example, if the state vector has two components and only the first component is observable, H_t can be [1 0]. In general, the H_t matrix can specify a linear relationship between the state and the observation, and it can be time-dependent. The imprecise measurement model introduced in Equation 34 becomes:

The hidden portion of the state can be specified using a matrix C_t, which is an orthogonal complement of H_t. For example, if H_t = [1 0], one choice for C_t is [0 1].

Figure 6d shows the computation for this case. The fusion phase can be understood intuitively in terms of the following steps.

The observable part of the a priori estimate of the state is fused with the measurement (z_t), using Equations 20–22. The quantity is called the innovation. The result is the a posteriori estimate of the observable state .
The BLUE of Theorem 4 is used to obtain the a posteriori estimate of the hidden state by adding to the a priori estimate of the hidden state a value obtained from the product of the covariance between H_tx_t|t–1 and C_ix_t|t–1 and the difference between and
The a posteriori estimates of the observable and hidden portions of the state are composed to produce the a posteriori estimate of the entire state .

The actual implementation produces the final result directly without going through these steps as shown in Figure 6d, but these incremental steps are useful for understanding how all this works, and their implementation is shown in more detail in Figure 8.

Figure 8. Computation of a posteriori estimate.

Figure 6d puts all this together. In the literature, this dataflow is referred to as Kalman filtering. Unlike in Equations 18 and 21, the Kalman gain is not a dimensionless value here. If H_t = I, the computations in Figure 6d reduce to those of Figure 6c as expected.

Equation 39 shows that the a posteriori state estimate is a linear combination of the a priori state estimate and the measurement (z_t). The optimality of this linear unbiased estimator is shown in the Appendix. It was shown earlier that incremental fusion of scalar estimates is optimal. The dataflow of Figures 6(c,d) computes the a posteriori state estimate at time t by incrementally fusing measurements from the previous time steps, and this incremental fusion can be shown to be optimal using a similar argument.

Example: falling body. To demonstrate the effectiveness of the Kalman filter, we consider an example in which an object falls from the origin at time t–0 with an initial speed of 0 m/s and an expected constant acceleration of 9.8 m/s² due to gravity. Note that acceleration in reality may not be constant due to factors such as wind, and air friction.

We have shown that Kalman filtering for state estimation in linear systems can be derived from two elementary ideas: optimal linear estimators for fusing uncorrelated estimates and best linear unbiased estimators for correlated variables.

The state vector of the object contains two components, one for the distance from the origin s(t) and one for the velocity v(t). We assume that only the velocity state can be measured at each time step. If time is discretized in steps of 0.25 seconds, the difference equation for the dynamics of the system is easily shown to be the following:

where we assume and .

The gray lines in Figure 9 show the evolution of velocity and distance with time according to this model. Because of uncertainty in modeling the system dynamics, the actual evolution of the velocity and position will be different in practice. The red lines in Figure 9 show one trajectory for this evolution, corresponding to a Gaussian noise term with covariance in Equation 32 (because this noise term is random, there are many trajectories for the evolution, and we are just showing one of them). The red lines correspond to “ground truth” in our example.

Figure 9. Estimates of the object’s state over time.

The green points in Figure 9b show the noisy measurements of velocity at different time steps, assuming the noise is modeled by a Gaussian with variance 8. The blue lines show the a posteriori estimates of the velocity and position. It can be seen that the a posteriori estimates track the ground truth quite well even when the ideal system model (the gray lines) is inaccurate and the measurements are noisy. The cyan bars in the right figure show the variance of the velocity at different time steps. Although the initial variance is quite large, application of Kalman filtering is able to reduce it rapidly in few time steps.

Discussion. We have shown that Kalman filtering for state estimation in linear systems can be derived from two elementary ideas: optimal linear estimators for fusing uncorrelated estimates and best linear unbiased estimators for correlated variables. This is a different approach to the subject than the standard presentations in the literature. One standard approach is to use Bayesian inference. The other approach is to assume that the a posteriori state estimator is a linear combination of the form , and then find the values of A_t and B_t that produce an unbiased estimator with minimum MSE. We believe that the advantage of the presentation given here is that it exposes the concepts and assumptions that underlie Kalman filtering.

Most presentations in the literature also begin by assuming that the noise terms w_t in the state evolution equation and v_t in the measurement are Gaussian. Although some presentations^1,10 use properties of Gaussians to derive the results in Figure 3, these results do not depend on distributions being Gaussians. Gaussians however enter the picture in a deeper way if one considers nonlinear estimators. It can be shown that if the noise terms are not Gaussian, there may be nonlinear estimators whose MSE is lower than that of the linear estimator presented in Figure 6d. However, if the noise is Gaussian, this linear estimator is as good as any unbiased nonlinear estimator (that is, the linear estimator is a minimum variance unbiased estimator (MVUE)). This result is proved using the Cramer-Rao lower bound.²⁴

Extension to Nonlinear Systems

The extended Kalman filter (EKF) and unscented Kalman filter (UKF) are heuristic approaches to using Kalman filtering for nonlinear systems. The state evolution and measurement equations for nonlinear systems with additive noise can be written as follows; in these equations, f and h are nonlinear functions.

Intuitively, the EKF constructs linear approximations to the nonlinear functions f and h and applies the Kalman filter equations, whereas the UKF constructs approximations to probability distributions and propagates these through the nonlinear functions to construct approximations to the posterior distributions.

EKF. Examining Figure 6d, we see that the a priori state estimate in the predictor can be computed using the system model: . However, as the system dynamics and measurement equations are nonlinear, it is not clear how to compute the co-variance matrices for the a priori estimate and the measurement. In the EKF, these matrices are computed by linearizing Equations 42 and 43 using the Taylor series expansions for the nonlinear functions f and h. This requires computing the following Jacobians^e, which play the role of F_t and H_t in Figure 6d.

The EKF performs well in some applications such as navigation systems and GPS.²⁸

UKF. When the system dynamics and observation models are highly nonlinear, the unscented Kalman filter (UKF)¹⁵ can be an improvement over the EKF. The UKF is based on the unscented transformation, which is a method for computing the statistics of a random variable x that undergoes a nonlinear transformation (y = g(x)). The random variable x is sampled using a carefully chosen set of sigma points and these sample points are propagated through the nonlinear function g. The statistics of y are estimated using a weighted sample mean and covariance of the posterior sigma points. The UKF tends to be more robust and accurate than the EKF but has higher computation overhead due to the sampling process.

Conclusion

In this article, we have shown that two concepts—optimal linear estimators for fusing uncorrelated estimates and best linear unbiased estimators for correlated variables—provide the underpinnings for Kalman filtering. By combining these ideas, standard results on Kalman filtering for linear systems can be derived in an intuitive and straightforward way that is simpler than other presentations of this material in the literature. This approach makes clear the assumptions that underlie the optimality results associated with Kalman filtering and should make it easier to apply Kalman filtering to problems in computer systems.

Acknowledgments

This research was supported by NSF grants 1337281, 1406355, and 1618425, and by DARPA contracts FA8750-16-2-0004 and FA8650-15-C-7563. We are indebted to K. Mani Chandy (Caltech), Ivo Babuska (UT Austin,) and Augusto Ferrante (Padova, Italy) for their valuable feedback.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

An Elementary Introduction to Kalman Filtering

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/3363294

November 2019 Issue

Published: November 1, 2019

Vol. 62 No. 11

Pages: 122-133

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Dec 20 2024

Strengthening Security Throughout the ML/AI Lifecycle

Alex Vakulov

Artificial Intelligence and Machine Learning

News Dec 18 2024

iBuyers, AI, and Real Estate

Gregory Goth

Architecture and Hardware

BLOG@CACM Dec 17 2024

Zero-Trust Security in Software Development

Harikrishna Kundariya

Computing Profession

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More