What statistics are preserved under aggregation?

Question

If we have a long, high resolution time series, with lots of noise, it often makes sense to aggregate the data to a lower resolution (say, daily to monthly values) to get a better understanding of what's going on, effectively removing some of the noise.

I've seen at least one paper that then applies some statistics to the aggregated data, including a $r^2$ for a linear regression on a separate variable. Is that valid? I would have thought that the averaging process would modify the result a fair bit, due to the reduced noise.

In general, are some statistics able to be applied to aggregated time series data, and others not? If so, which ones? Ones that are linear combinations, maybe?

Related, see the [ecological fallacy](http://en.wikipedia.org/wiki/Ecological_fallacy). — Andy W, Oct 02 '13 at 12:29
regarding the comment from @cbeleites, I think there is a theoretical answer here - an expansion of your suggestion that linear combinations are preserved. However, in practical application terms, it is very hard to draw a general conclusion on the validity of an approach, and there would need to be a specific example. — Jonathan, Oct 09 '13 at 02:34

score 6 · Answer 1 · answered Oct 02 '13 at 10:20

I think the question as in the headline is too broad to be answered in a useful way, the more so as it will probably depend on both the aggregating method and the statistic in question.

This will even apply to the "mean": do you try to preserve signal shape and intensity (e.g. Savitzky-Golay filters), or do you try to preserve the area under the signal (e.g. loess)?
Noise-related statistics are obviously affected: that is usually the purpose of the aggregation.

I've seen at least one paper that then applies some statistics to the aggregated data [...] Is that valid? I would have thought that the averaging process would modify the result a fair bit, due to the reduced noise.

This modification is most probably the purpose of the aggregating.

In general, you are allowed to do a whole lot of things to your data, but you need to

say what you are doing (and preferrably also why you do it)
show the quality of the resulting model (test with independent data)

What is a valid aggregation will also depend on your application.
E.g.: I'm working with spectroscopic data. It is very common to aggregate single spectra into average spectra: the measurement process means certain limits to the quality of spectra I can obtain "in one shot". However, for many applications it is perfectly valid to specify an acquisition procedure that says that always $n$ repeated measurements should be taken and averaged. On the other hand, if the application is real-time/online or inline analytics such as FIA (flow injection analysis) this implies restrictions on possible aggregation schemes.

score 5 · Answer 2 · answered Oct 02 '13 at 13:00

In a regression setting you can actually test whether the simple aggregation is the correct choice. Suppose you have monthly data $Y_t$ and daily data $X_\tau$ (with the fixed $m$ days in a month). Suppose you are interested in a regression:

$$Y_t=\alpha+\beta \bar X_t +u_t, (1)$$

where $$\bar X_t=\frac{1}{m}\sum_{h=0}^{m-1}X_{tm-h}.$$

Here we assume that for each month $t$ the daily observations are $X_{30(t-1)+1},...,X_{30t}$. In this case we assumed that each day has the same weight, which clearly is a restriction. So we can assume that more general model holds:

$$Y_t=\alpha+\beta \bar X_{t}^{(w)} +u_t,(2)$$

with

$$X_t^{(w)}=\sum_{h=1}^{m-1}w_hX_{tm-h}.$$

There are a lot of articles which explore different possible choices of $w_h$. Usually it is assumed that $w_h=g(h,\alpha)$, for some function $g$ which depends on parameters $\alpha$. This type of regression model is called MIDAS (MIxed DAta Sampling) regression.

Model (2) nests the model (1) so it is possible to test the hypothesis that $w_h=\frac{1}{m}$. One such test is proposed in this article (I am one of the authors, sorry for the shameless plug, also I wrote an R package midasr for estimating and testing MIDAS regressions where this test is implemented).

In a non-regression setting there are results which show that aggregation can change the properties of the time series. For example if you aggregate AR(1) processes which have short term memory (the correlation between two observations of the time series quickly dies off when the distance between them is increased), you can get a process with long term memory.

So to sum up the answer is that validity of application of statistics on aggregated data is a statistical question. Depending on the model you can construct a hypothesis whether it is a valid application or not.

What statistics are preserved under aggregation?

2 Answers2