2

I am sampling from a random process $X$ and I would like to calculate $R^2$ for the cumulative sum of the samples: $$x_1,..x_n$$ $$y_n=\sum_0^n x_i$$

$$R^2_n=RSQ( [1,2,...n], [y_1,y_2,..,y_n])$$

The calculation becomes increasingly slow as $n$ grows. Do you know any incremental way to update $R^2$ at every new sample, without recalculating it from the beginning every time?

elemolotiv
  • 1,048
  • 7
  • 20
  • 1
    You really don't want something as badly behaved as that formula. That's frequently a disastrous way to calculate variance. There are much more stable ways to calculate variance. Note that R^2 can be written as a ratio of two sums of squares – Glen_b Oct 01 '19 at 12:41
  • thanks @Glen_b I edited away the analogy of incremental variance calculation – elemolotiv Oct 01 '19 at 12:46
  • 2
    You could adapt the online updating approach [here](https://stats.stackexchange.com/a/410471/805), but instead of calculating $r$, calculate its square; i.e. $r^2 = \frac{N_{n+1}^2}{D_{n+1}E_{n+1}}$. You can speed it up further than that (e.g. by using ideas from [Welford's algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm); and the equivalent for [covariance](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online) and taking advantage of the simple form of the 1,2,3... values & hence their mean and sum of squares),...ctd – Glen_b Oct 01 '19 at 13:08
  • ctd ... but you'll probably find that first method sufficient just applied directly to the y's and the 1,2,3... values – Glen_b Oct 01 '19 at 13:21
  • I'm in two minds whether it counts as effectively a duplicate of that first link or whether there's enough in the special structure of this problem to leave it. – Glen_b Oct 01 '19 at 13:29
  • See [Efficient online regression](https://stats.stackexchange.com/questions/6920/efficient-online-linear-regression/6923#6923) for how to update *everything* efficiently. – whuber Oct 01 '19 at 14:50
  • I'm having trouble understanding what your $R^2$ refers to. $R^2$ measures the goodness of fit of predictions to known target values. But, what are your target values, what are your predictions, and how are you generating them? – user20160 Oct 01 '19 at 15:35
  • @Glen_b thanks for the comments, there is enough info to work on. Up to you whether to keep or discard my question. I have saved the links in your comments – elemolotiv Oct 01 '19 at 17:23
  • @user20160 When there's only one predictor for a linear regression model $R^2$ will simply be the squared correlation between the two series of values. One needn't even identify which is the DV and which is the IV to calculate the correlation between them; the question offers enough information (identifying the two series) to answer the question. – Glen_b Oct 01 '19 at 23:42
  • @Glen_b True, but the question didn't mention anything about linear regression; I had hoped the OP could be more explicit since this doesn't hold for nonlinear models. In any case, I guess it doesn't matter much at this point. – user20160 Oct 02 '19 at 00:15
  • On the other hand, R^2 doesn't really make so much sense for a nonlinear model; I'd expect that would be explicitly mentioned (and defined) if it were the case; secondly the OP mentioned `RSQ` which is an [Excel function](https://docs.microsoft.com/en-us/office/troubleshoot/excel/statistical-functions-rsq) which does the squared-correlation calculation I discussed. – Glen_b Oct 02 '19 at 00:26

0 Answers0