What effect does data averaging have on the variogram?

Question

What effect does data averaging have on the variogram? To be specific, please see a simple example:

#Simulate a pure random walk data set, call this *PROCESS A*
n <- 1000  #number of data points
t <- 1:n #time
y <- cumsum(rnorm(n)) #data points

# Averaging at every 2 lags, call this *PROCESS B*
t2 <- apply(matrix(t, nrow=n/2, byrow=TRUE), 1, mean)
y2 <- apply(matrix(y, nrow=n/2, byrow=TRUE), 1, mean)

# Compute and plot the empirical variogram
require(geoR)
var <- as.geodata(data.frame(coords = t(rbind(t, rep(1, n))), data.col=y))
vario <- variog(var1) 
var2 <- as.geodata(data.frame(coords = t(rbind(t2, rep(1, n/2))), data.col=y2))
vario2 <- variog(var2) 
plot(vario$u, vario$v, type='b', pch=16, cex=.7)
points(vario2$u, vario2$v, col=2, type='b', pch=16, cex=.7)

Theoretically $\gamma(h) = \frac{1}{2} Var(y(t) - y(t+h))$ and for a linear variogram of a simple process without the nugget effect this is $\frac{1}{2}\sigma^2 |h|$ where $\sigma^2$ is the variance of the underlying process. Thus, we expect (1) the two variogram are parallel and (2) the proportion of the nugget effect of PROCESS A to PROCESS B is 4/3. Compare the plots and the Math: (1) is confirmed, but (2) is not. Why? Is this a simple example of the 'change of support' problem in geostatistics?? Thanking you in advance.

whuber · Accepted Answer · 2014-04-28T22:14:46.407

It is difficult to say what you mean by "(2) is not [confirmed]," so let's begin by describing what is going on.

The code begins by generating realizations of $n$ independent standard normal variates $X_t,$ $t=1,2,\ldots, n$, and computing their cumulative sum $Y_t = \sum_{s=1}^t X_t$. Therefore the variogram of $Y$ is

$$\gamma_Y(t, h) = \frac{1}{2}\mathbb{E}\left(\left(Y_{t+h} - Y_t\right)^2\right) = \frac{1}{2}\mathbb{E}\left(\left(\sum_{s=t+1}^{t+h}X_s\right)^2\right) = \frac{h}{2}$$

for $h=0, 1, 2, \ldots, n-t$ and $t=0, 1, 2, \ldots, n.$ Notice that $Y$ has no nugget for any value of $t$: $\gamma(t,0) = 0$ for all relevant $t$.

The code next generates data from a derived process y2 which I will call $Z$, where

$$Z_{t + 1/2} = \frac{Y_{t} + Y_{t+1}}{2} = Y_{t} + \frac{1}{2}X_{t+1}$$

for $t=1, 3, 5, ..., 2\lfloor\frac{n}{2}\rfloor - 1.$

The variogram of $Z$ therefore is

$$\gamma_Z(t+1/2, h) = \frac{1}{2}\mathbb{E}\left(\left(Z_{t+1/2+h} - Z_{t+1/2}\right)^2\right) \\ = \frac{1}{2}\mathbb{E}\left(\left(\frac{1}{2}X_{t+1}+\sum_{s=t+2}^{t+h}X_s + \frac{1}{2}X_{t+h+1}\right)^2\right) = \frac{h - 1/2}{2}$$

for $h=2, 4, 6, \ldots$ and $t=1, 3, \ldots.$ When $h=0$ the variogram is obviously $0$. Thus the second plot (of an estimate of $\gamma_Z$ based on the single realization of $Z$) should, on the average, be approximately $1/4$ lower than the first plot (of an estimate of $\gamma_Y$). This difference is too small to detect reliably in the plots, which is why they look almost identical. For instance, re-running the R script after executing

set.seed(17)
n <- 10^4

for reproducible results with a larger realization and computing

mean(vario$v - vario2$v)

results in an estimate $0.22$ for this difference of $1/4$. More runs of the code (without resetting the seed) yield $0.43, 0.19, 0.26, 0.17, -0.01, 0.76, 0.23, 0.26, 0.08$. The mean of all 10 iterations is $0.258$ with a standard error of $0.067$: significantly different from zero but indistinguishable from $1/4$, consistent with the foregoing calculations.

This is indeed a (very simple) example of change of support, which uses the same kinds of calculations to relate the variogram of a process $Y$ to the variogram of a related process obtained from the convolution of $Y$ with some (fixed, deterministic) kernel $K$. In this case the kernel is

$$K(-1/2)=K(1/2)=1/2.$$

Brilliant answer, thanks so much. Could you kindly provide some references on your last point, about the COSP, and kernel convolution — T_D, Apr 28 '14 at 22:22
Any geostats text that covers change of support should explain this. One of the oldest (and still useful) is Journal & Huijbregts, *Mining Geostatistics*. Cressie obtains a general result for block averages in *Statistics for Spatial Data.* The kernel convolution formulation of the concept of support appears in the introduction to Diggle & Ribeiro Jr., *Model-based Geostatistics.* — whuber, Apr 28 '14 at 22:27
I'm unsure what proportion you are taking, but similar calculations will enable you to work out the nugget for any process linearly derived from $Y$. In order to discuss this topic well, we would need first to distinguish a true nugget from measurement error: the two will enter differently into the calculations. Diggle & Ribeiro Jr. (*op. cit.*) do a good job of distinguishing these terms. — whuber, Apr 28 '14 at 22:30
The variogram can never be negative: it is defined, after all, as the expectation of a squared quantities and squares are never negative. What happens is that the variogram of the convolution with a local kernel is itself not only lowered, but also averaged out locally in some sense. As the lag approaches $0$, the averaging prevents the variogram from becoming negative. — whuber, Apr 30 '14 at 13:13

score 2 · Answer 2 · answered Apr 28 '14 at 15:12

The variogram is a function of the correlation function, inter alia. The correlation function is affected by data averaging. Check out the Slutsky-Yule effect. This is an old result in time series analysis to the effect that if you do a moving average on white noise, you induce autocorrelations into the transformed series. Visually, the moving averaged white noise appears to the imaginative to have quasi-periodic effects. It is tempting to take a moving average of the data in the hope of allowing genuine features to appear, but Slutsky-Yule warns you that the averaged data will have features that are an artifact of the smooth. I would imagine that a similar thing would happen to the variogram of spatial data.

What effect does data averaging have on the variogram?

2 Answers2

Linked