0

I'm reviewing a work of some collegues which correlates the daily number of tweets on certain topic with some other indicators.

They used of loess smoothing regression to diminish the noise of the variables. They did this with a trick I didn't know:

(r pseudo code)

y <- original.data
x <- 1:length(y)
y.loess <- loess(y ~ x, span=0.35, data.frame(x=x, y=y))
y.predict <- predict(y.loess, data.frame(x=x))

They made a loess regression using a uniform vector as regressor. Clever indeed.

The pearson correlation of the data after smoothing is ~ 0.8 - 0.9, while before the smoothing was ~ 0.6.

I asked them the data and run some test. I found that the correlation between the variables is strongly bound to the value used as span in the loess smoothing. Here's I plot the change in correlation as span increase (NB: this data is based on a span serie which values are (5:100/20)^2)

plot(series$tweet.x.sum)

Change in correlation for different loess span parameter values

So clearly the span parameter value influence the correlation statistic. My questions are then: is it correct to smooth data this way to run statistics? How one should decide a value for span, cross validation? Or this smoothing is only good for nicer plots and use it for correlations is simply cheating?

Thanks

Bakaburg
  • 2,293
  • 3
  • 21
  • 30
  • The index `1:length(y)` has a simple interpretation as the time in days, hence they are smoothing `y` with respect to time to extract a smooth non-linear trend from the original series. They then use the fitted values of this smoother, the trend, in some correlation, the details of which you haven't supplied. – Gavin Simpson Mar 03 '15 at 18:12
  • Ok, but is it legitimate? and what about the effect on correlation the span parameter have? to be noted, both predictor and outcome in the correlation are smoothed the same way (can't say more about the outcome because it's an unpublished study) – Bakaburg Mar 03 '15 at 20:29
  • If they are correlating two smooth trends, then this will make things look better than they are because the smoothing throws away a lot of the noise (for a lot, depends on the how much smoothing they do) and then correlates the de-noised variables. I've seen strong admonishments in some areas for smoothing data and then doing stats on the smoothed variables for that reason. – Gavin Simpson Mar 03 '15 at 22:52
  • If you want to see the ultimate in nonsense, take a bunch of random walks, compute their (meaningless) correlation matrix, then smooth them, and look at the new correlation matrix. Amazing! Wait, what's the right word for it? Oh, nonsense, that's it. Smoothed spurious regressions ... *spuriouser and spuriouser*, as Alice almost said. – Glen_b Mar 04 '15 at 03:07
  • In fact, take a look [here](http://stats.stackexchange.com/questions/133155/how-to-use-pearson-correlation-correctly-with-time-series) where I discuss such examples. When dealing with relationships among time series, you have to be very careful to avoid the nonsense relationships naive approaches generate, and when smoothing, doubly so. – Glen_b Mar 04 '15 at 03:13

0 Answers0