This answer consists of an introductory section I wrote recently for a paper describing a (modest) spatio-temporal extension of "Universal Kriging" (UK), which itself is a modest generalization of "Ordinary Kriging." It has three sub-sections: Theory gives a statistical model and assumptions; Estimation briefly reviews least-squares parameter estimation; and Prediction shows how kriging fits into the Generalized Least Squares (GLS) framework. I have made an effort to adopt notation familiar to statisticians, especially visitors to this site, and to use concepts that are well-explained here.
To summarize, kriging is the Best Linear Unbiased Prediction (BLUP) of a random field. What this means is that the predicted value at any unsampled location is obtained as a linear combination of the values and covariates observed at sampled locations. The (unknown, random) value there has an assumed correlation with the sample values (and the sample values are correlated among themselves). This correlation information is readily translated into the variance of the prediction. One chooses coefficients in the linear combination (the "kriging weights") that make this variance as small as possible, subject to a condition of zero bias in the prediction. The details follow.
Theory
UK comprises two procedures – one of estimation and the other of prediction – carried out in the context of a GLS model for a study area. The GLS model supposes that the sample data $z_i,\ (i = 1, 2, ..., n)$ are the result of random deviations around a trend and that those deviations are correlated. A trend is meant in the general sense of a value that can be determined by a linear combination of $p$ unknown coefficients (parameters) $\beta=(\beta_1,\beta_2,\ldots,\beta_p)^\prime$. (Throughout this post, the prime $^\prime$ denotes matrix transpose and all vectors are considered column vectors.)
At any location within a study area there is available a tuple of numerical attributes $\mathbf y = (y_1, y_2, \ldots, y_p)^\prime$ termed “independent variables” or “covariates.” (Typically $y_1 = 1$ is a “constant term,” $y_2$ and $y_3$ may be spatial coordinates, and the additional $y_i$ may represent spatial information as well as other ancillary information that is available at all locations in the study area, such as porosity of an aquifer or distance to a pumping well.) At each data location $i$, in addition to its covariates $y_i = (y_{i1}, y_{i2}, \ldots, y_{ip})^\prime$, the associated observation $z_i$ is considered to be a realization of a random variable $Z_i$. In contrast, the $y_i$ are thought of as values determined by or characterizing the points or small regions represented by the observations (the data “supports”). The $y_i$ are not considered to be realizations of random variables and are required to be unrelated to the properties of any of the $Z_i$.
The linear combination
$$
{\bf{E}}\left[ {Z_i } \right] = {\bf{y'}}_i {\bf{\beta }} = y_{i1} \beta _1 + y_{i2} \beta _2 + \cdots + y_{ip} \beta _p
$$
expresses the expected value of $Z_i$ in terms of the parameters $\beta$, which is the value of the trend at location $i$. The estimation process uses the data to find values $\hat\beta_i$ that represent the unknown parameters $\beta_i$, whereas the prediction process uses the data at locations $i = 1, 2, \ldots, n$ to compute a value at an un-sampled location, which is here indexed as $i = 0$. The targets of estimation are fixed (i.e., non-random) parameters whereas the target of prediction is random, because the value $z_0$ includes a random fluctuation around its trend $y_0^\prime\beta$. Typically, predictions are made for multiple locations using the same data by varying location $0$. For example, predictions are often made to map out a surface along a regular grid of points suitable for contouring.
Estimation
Classical kriging assumes the random fluctuations $Z_i$ have expected values of zero and their covariances are known. Write the covariance between $Z_i$ and $Z_j$ as $c_{ij}$. Using this covariance, estimation is performed using GLS. Its solution is the following:
$$
\hat\beta=\bf{Hz},\ {\bf{H}} = \left( {{\bf{Y'C}}^{{\bf{ - 1}}} {\bf{Y}}} \right)^{{\bf{ - 1}}} {\bf{Y'C}}^{{\bf{ - 1}}}
$$
where ${\bf {z}} = (z_1, z_2, \ldots, z_n)$ is the $n$-vector of the observations, ${\bf Y} = (y_{ij})$ (the “design matrix”) is the $n$ by $p$ matrix whose rows are the vectors $y_i^\prime, 1 \le i \le n$, and $\mathbf C = (c_{ij})$ is the $n$-by-$n$ covariance matrix which is assumed to be invertible (Draper & Smith (1981), section 2.11). The $p$ by $n$ matrix $\mathbf H$, which projects the data $\mathbf z$ onto the parameter estimates $\hat \beta$, is called the “hat matrix.” The formulation of $\hat\beta$ as the application of the hat matrix to the data explicitly shows how the parameter estimates depend linearly on the data. The covariances $\mathbf C = (c_{ij})$ are classically computed using a variogram which gives the covariance in terms of the data locations, though it is immaterial how the covariance is actually calculated.
Prediction
UK similarly predicts $z_0$ by means of a linear combination of the data
$$
\hat z_0 = \lambda _1 z_1 + \lambda _2 z_2 + \cdots + \lambda _n z_n = {\bf{\lambda 'z}}.
$$
The $\lambda_i$ are called the “kriging weights” for the prediction of $z_0$. UK accomplishes this prediction of $z_0$ by meeting two criteria. First, the prediction should be unbiased, which is expressed by requiring that the linear combination of the random variables $Z_i$ equals $Z_0$ on average:
$$
0 = {\bf{E}}\left[ {\hat Z_0 - Z_0 } \right] = {\bf{E}}\left[ {{\bf{\lambda 'Z}} - Z_0 } \right].
$$
This expectation is taken over the joint $n+1$-variate distribution of $Z_0$ and $\mathbf Z = (Z_1, Z_2, \ldots, Z_n)$. Linearity of expectation together with the trend assumption (1) implies:
$$
\eqalign{
0 &= {\bf{E}}\left[ {{\bf{\lambda 'Z}} - Z_0 } \right] = {\bf{\lambda 'E}}\left[ {\bf{Z}} \right] - {\bf{E}}\left[ {Z_0 } \right] = {\bf{\lambda '}}\left( {{\bf{Y\beta }}} \right) - {\bf{y'}}_0 {\bf{\beta }} = \left( {{\bf{\lambda 'Y}} - {\bf{y'}}_0 } \right){\bf{\beta }}\\
&= {\bf{\beta '}}\left( {{\bf{Y'\lambda }} - {\bf{y}}_0 } \right)
}$$
no matter what $\beta$ may be. This will be the case provided that
$$\hat{\mathbf Y}^\prime \lambda = \mathbf{y}_0.$$
Among all the possible solutions of this underdetermined system of equations, UK chooses $\lambda$ to minimize the variance of the prediction error $\hat Z_0 - Z_0$. In this sense, UK is “best” among all unbiased linear predictors. Because this last relationship implies the prediction error is zero on average, the variance is simply the expectation of the squared prediction error:
$$
{\rm{Var}}\left( {\hat Z_0 - Z_0 } \right) = {\bf{E}}\left[ {\left( {\hat Z_0 - Z_0 } \right)^2 } \right] = {\bf{E}}\left[ {\left( {{\bf{\lambda 'Z}} - Z_0 } \right)^2 } \right] = c_{00} - 2{\bf{\lambda 'c}}_0 + {\bf{\lambda 'C\lambda }}$$
where $\mathbf c_0 = (c_{01}, c_{02}, \ldots, c_{0n})^\prime$ is the vector of covariances between $Z_0$ and the $Z_i,\ i \ge 1$, and $c_{00}$ is the variance of $Z_0$.
To minimize the variance, differentiate with respect to $\lambda$ and introduce a vector of $p$ Lagrange multipliers $\mu$ to incorporate into the constraint $\hat{\mathbf Y}^\prime \lambda = \mathbf{y}_0$. This yields a system of $n+p$ linear equations, written in block-matrix form as
$$
\left( {\begin{array}{*{20}c}
{\bf{C}} & {\bf{Y}} \\
{{\bf{Y'}}} & {\bf{0}} \\
\end{array}} \right)\left( {\begin{array}{*{20}c}
{\bf{\lambda }} \\
{\bf{\mu }} \\
\end{array}} \right) = \left( {\begin{array}{*{20}c}
{{\bf{c}}_{\bf{0}} } \\
{{\bf{y}}_{\bf{0}} } \\
\end{array}} \right)$$
where $\mathbf 0$ represents a $p$ by $p$ matrix of zeros. Writing $\mathbf 1$ for the $n$ by $n$ identity matrix, the unique solution for $\lambda$ is given by
$$
{\bf{\lambda }} = {\bf{H'y}}_0 + {\bf{C}}^{ - 1} \left( {{\bf{1}} - {\bf{YH}}} \right){\bf{c}}_0.
$$
(Readers familiar with multiple regression may find it instructive to compare this solution to the covariance-based solution of the ordinary least squares Normal equations, which looks almost exactly the same, but with no Lagrange multiplier terms.)
This relationship presents the kriging weights, $\lambda$, as the sum of a term depending only on the hat matrix and the covariates at the prediction location $[\mathbf H^\prime\, \mathbf y_0]$, plus a term depending on the covariances among the data and the predictand, $Z_0$. Substituting it into the right hand side of the variance equation yields the kriging prediction variance, which can be used to construct prediction limits around $\hat z_0$.