How Does Kriging Interpolation work?

Question

I am working on a problem in which I need to use Kriging to predict the value of some variables based on some surrounding variables. I want to implement its code by myself. So, I've went through too many documents to understand how it works, but I was so much confused. Generally, I understand that its a weighted average, but I couldn't completely understand the process of calculating the weight then predict the value of a variable.

Can anyone please explain to me in simple terms the mathematical aspects of this interpolating methods and how it works?

Implementing code is a great learning tool but cannot be recommended for working on actual problems. By the time you get the code written, debugged, and tested, you will discover it needs an order of magnitude more effort to provide supplemental tools for spatial exploratory data analysis, variography, cross-validation of the variogram, neighborhood searching, and post-processing of the kriged results. A reasonable and effective compromise would be to start with working code, such as [GSLib](http://www.gslib.com/) or [GeoRGLM](http://cran.r-project.org/web/packages/geoRglm), and modify that. — whuber, Jun 18 '15 at 11:34
Thanks a lot, it's a great idea, but I also want to understand the mathematical aspect of Kriging, do you have a resource that explains it clearly in simple terms? Thank you. — Dania, Jun 18 '15 at 11:40

whuber · Accepted Answer · 2017-06-28T13:33:11.007

This answer consists of an introductory section I wrote recently for a paper describing a (modest) spatio-temporal extension of "Universal Kriging" (UK), which itself is a modest generalization of "Ordinary Kriging." It has three sub-sections: Theory gives a statistical model and assumptions; Estimation briefly reviews least-squares parameter estimation; and Prediction shows how kriging fits into the Generalized Least Squares (GLS) framework. I have made an effort to adopt notation familiar to statisticians, especially visitors to this site, and to use concepts that are well-explained here.

To summarize, kriging is the Best Linear Unbiased Prediction (BLUP) of a random field. What this means is that the predicted value at any unsampled location is obtained as a linear combination of the values and covariates observed at sampled locations. The (unknown, random) value there has an assumed correlation with the sample values (and the sample values are correlated among themselves). This correlation information is readily translated into the variance of the prediction. One chooses coefficients in the linear combination (the "kriging weights") that make this variance as small as possible, subject to a condition of zero bias in the prediction. The details follow.

Theory

UK comprises two procedures – one of estimation and the other of prediction – carried out in the context of a GLS model for a study area. The GLS model supposes that the sample data $z_i,\ (i = 1, 2, ..., n)$ are the result of random deviations around a trend and that those deviations are correlated. A trend is meant in the general sense of a value that can be determined by a linear combination of $p$ unknown coefficients (parameters) $\beta=(\beta_1,\beta_2,\ldots,\beta_p)^\prime$. (Throughout this post, the prime $^\prime$ denotes matrix transpose and all vectors are considered column vectors.)

At any location within a study area there is available a tuple of numerical attributes $\mathbf y = (y_1, y_2, \ldots, y_p)^\prime$ termed “independent variables” or “covariates.” (Typically $y_1 = 1$ is a “constant term,” $y_2$ and $y_3$ may be spatial coordinates, and the additional $y_i$ may represent spatial information as well as other ancillary information that is available at all locations in the study area, such as porosity of an aquifer or distance to a pumping well.) At each data location $i$, in addition to its covariates $y_i = (y_{i1}, y_{i2}, \ldots, y_{ip})^\prime$, the associated observation $z_i$ is considered to be a realization of a random variable $Z_i$. In contrast, the $y_i$ are thought of as values determined by or characterizing the points or small regions represented by the observations (the data “supports”). The $y_i$ are not considered to be realizations of random variables and are required to be unrelated to the properties of any of the $Z_i$.

The linear combination $$ {\bf{E}}\left[ {Z_i } \right] = {\bf{y'}}_i {\bf{\beta }} = y_{i1} \beta _1 + y_{i2} \beta _2 + \cdots + y_{ip} \beta _p $$ expresses the expected value of $Z_i$ in terms of the parameters $\beta$, which is the value of the trend at location $i$. The estimation process uses the data to find values $\hat\beta_i$ that represent the unknown parameters $\beta_i$, whereas the prediction process uses the data at locations $i = 1, 2, \ldots, n$ to compute a value at an un-sampled location, which is here indexed as $i = 0$. The targets of estimation are fixed (i.e., non-random) parameters whereas the target of prediction is random, because the value $z_0$ includes a random fluctuation around its trend $y_0^\prime\beta$. Typically, predictions are made for multiple locations using the same data by varying location $0$. For example, predictions are often made to map out a surface along a regular grid of points suitable for contouring.

Estimation

Classical kriging assumes the random fluctuations $Z_i$ have expected values of zero and their covariances are known. Write the covariance between $Z_i$ and $Z_j$ as $c_{ij}$. Using this covariance, estimation is performed using GLS. Its solution is the following: $$ \hat\beta=\bf{Hz},\ {\bf{H}} = \left( {{\bf{Y'C}}^{{\bf{ - 1}}} {\bf{Y}}} \right)^{{\bf{ - 1}}} {\bf{Y'C}}^{{\bf{ - 1}}} $$ where ${\bf {z}} = (z_1, z_2, \ldots, z_n)$ is the $n$-vector of the observations, ${\bf Y} = (y_{ij})$ (the “design matrix”) is the $n$ by $p$ matrix whose rows are the vectors $y_i^\prime, 1 \le i \le n$, and $\mathbf C = (c_{ij})$ is the $n$-by-$n$ covariance matrix which is assumed to be invertible (Draper & Smith (1981), section 2.11). The $p$ by $n$ matrix $\mathbf H$, which projects the data $\mathbf z$ onto the parameter estimates $\hat \beta$, is called the “hat matrix.” The formulation of $\hat\beta$ as the application of the hat matrix to the data explicitly shows how the parameter estimates depend linearly on the data. The covariances $\mathbf C = (c_{ij})$ are classically computed using a variogram which gives the covariance in terms of the data locations, though it is immaterial how the covariance is actually calculated.

Prediction

UK similarly predicts $z_0$ by means of a linear combination of the data $$ \hat z_0 = \lambda _1 z_1 + \lambda _2 z_2 + \cdots + \lambda _n z_n = {\bf{\lambda 'z}}. $$ The $\lambda_i$ are called the “kriging weights” for the prediction of $z_0$. UK accomplishes this prediction of $z_0$ by meeting two criteria. First, the prediction should be unbiased, which is expressed by requiring that the linear combination of the random variables $Z_i$ equals $Z_0$ on average: $$ 0 = {\bf{E}}\left[ {\hat Z_0 - Z_0 } \right] = {\bf{E}}\left[ {{\bf{\lambda 'Z}} - Z_0 } \right]. $$ This expectation is taken over the joint $n+1$-variate distribution of $Z_0$ and $\mathbf Z = (Z_1, Z_2, \ldots, Z_n)$. Linearity of expectation together with the trend assumption (1) implies: $$ \eqalign{ 0 &= {\bf{E}}\left[ {{\bf{\lambda 'Z}} - Z_0 } \right] = {\bf{\lambda 'E}}\left[ {\bf{Z}} \right] - {\bf{E}}\left[ {Z_0 } \right] = {\bf{\lambda '}}\left( {{\bf{Y\beta }}} \right) - {\bf{y'}}_0 {\bf{\beta }} = \left( {{\bf{\lambda 'Y}} - {\bf{y'}}_0 } \right){\bf{\beta }}\\ &= {\bf{\beta '}}\left( {{\bf{Y'\lambda }} - {\bf{y}}_0 } \right) }$$

no matter what $\beta$ may be. This will be the case provided that

$$\hat{\mathbf Y}^\prime \lambda = \mathbf{y}_0.$$

Among all the possible solutions of this underdetermined system of equations, UK chooses $\lambda$ to minimize the variance of the prediction error $\hat Z_0 - Z_0$. In this sense, UK is “best” among all unbiased linear predictors. Because this last relationship implies the prediction error is zero on average, the variance is simply the expectation of the squared prediction error: $$ {\rm{Var}}\left( {\hat Z_0 - Z_0 } \right) = {\bf{E}}\left[ {\left( {\hat Z_0 - Z_0 } \right)^2 } \right] = {\bf{E}}\left[ {\left( {{\bf{\lambda 'Z}} - Z_0 } \right)^2 } \right] = c_{00} - 2{\bf{\lambda 'c}}_0 + {\bf{\lambda 'C\lambda }}$$ where $\mathbf c_0 = (c_{01}, c_{02}, \ldots, c_{0n})^\prime$ is the vector of covariances between $Z_0$ and the $Z_i,\ i \ge 1$, and $c_{00}$ is the variance of $Z_0$.

To minimize the variance, differentiate with respect to $\lambda$ and introduce a vector of $p$ Lagrange multipliers $\mu$ to incorporate into the constraint $\hat{\mathbf Y}^\prime \lambda = \mathbf{y}_0$. This yields a system of $n+p$ linear equations, written in block-matrix form as $$ \left( {\begin{array}{*{20}c} {\bf{C}} & {\bf{Y}} \\ {{\bf{Y'}}} & {\bf{0}} \\ \end{array}} \right)\left( {\begin{array}{*{20}c} {\bf{\lambda }} \\ {\bf{\mu }} \\ \end{array}} \right) = \left( {\begin{array}{*{20}c} {{\bf{c}}_{\bf{0}} } \\ {{\bf{y}}_{\bf{0}} } \\ \end{array}} \right)$$ where $\mathbf 0$ represents a $p$ by $p$ matrix of zeros. Writing $\mathbf 1$ for the $n$ by $n$ identity matrix, the unique solution for $\lambda$ is given by $$ {\bf{\lambda }} = {\bf{H'y}}_0 + {\bf{C}}^{ - 1} \left( {{\bf{1}} - {\bf{YH}}} \right){\bf{c}}_0. $$

(Readers familiar with multiple regression may find it instructive to compare this solution to the covariance-based solution of the ordinary least squares Normal equations, which looks almost exactly the same, but with no Lagrange multiplier terms.)

This relationship presents the kriging weights, $\lambda$, as the sum of a term depending only on the hat matrix and the covariates at the prediction location $[\mathbf H^\prime\, \mathbf y_0]$, plus a term depending on the covariances among the data and the predictand, $Z_0$. Substituting it into the right hand side of the variance equation yields the kriging prediction variance, which can be used to construct prediction limits around $\hat z_0$.

Thank you so much whuber, this is exactly what I'm looking for. You have solved this problem for me, now I understand Kriging. I really appreciate your help, thanks a lot. — Dania, Jun 18 '15 at 18:56
Fantastic explanation. One question: What does $\hat{\mathbf Y}^\prime$ represent? How is it defined? Is it part of the givens? What does the prime mean? This variable is introduced without being defined, so I'm a bit confused about what it is defined to be. — D.W., Jun 28 '17 at 03:52
@D.W. The prime denotes transpose throughout this post. Thus, taking the transpose of the definition in the answer, we may describe this matrix as "${\bf Y}^\prime = (y_{ji})$ is the $p$ by $n$ matrix whose columns are the vectors $y_i, 1 \le i \le n$." It thereby encapsulates the dataset of covariates. — whuber, Jun 28 '17 at 13:30

How Does Kriging Interpolation work?

1 Answers1

Theory

Estimation

Prediction

Linked