I'm reviewing a work of some collegues which correlates the daily number of tweets on certain topic with some other indicators.
They used of loess smoothing regression to diminish the noise of the variables. They did this with a trick I didn't know:
(r pseudo code)
y <- original.data
x <- 1:length(y)
y.loess <- loess(y ~ x, span=0.35, data.frame(x=x, y=y))
y.predict <- predict(y.loess, data.frame(x=x))
They made a loess regression using a uniform vector as regressor. Clever indeed.
The pearson correlation of the data after smoothing is ~ 0.8 - 0.9, while before the smoothing was ~ 0.6.
I asked them the data and run some test. I found that the correlation between the variables is strongly bound to the value used as span in the loess smoothing. Here's I plot the change in correlation as span increase (NB: this data is based on a span serie which values are (5:100/20)^2)
plot(series$tweet.x.sum)
So clearly the span parameter value influence the correlation statistic. My questions are then: is it correct to smooth data this way to run statistics? How one should decide a value for span, cross validation? Or this smoothing is only good for nicer plots and use it for correlations is simply cheating?
Thanks