Sampling distribution of $R^2$

Question

Given a sample from a population, and a model fit to that sample $y=f(x)+\epsilon$, what is the sampling distribution of $R^2$? (i.e.: scaled brier score). I could compute its variance easily enough, but $R^2$ can't go above 1, and a confidence interval based just on its mean and variance might go above 1. I'm looking for an approach that's applicable to different $\epsilon$ (i.e.: binomial, gamma, etc.) and different $f$'s (i.e.: arbitrary algorithms). I've got a use-case where bootstrapping is costly.

For clarity, the quantity that I'm calling $R^2$ (or scaled brier score) is defined as $$ 1-\frac{(y-\hat{f}(x))^2}{(y-\bar{y})^2} $$

where $\hat{f}$ is the fitted model and $\bar{y}$ is the mean of $y$. Assume for simplicity that $y$ is not categorical, though I'd be interested if this could generalize to multinomial logits.

Related question here discusses the distribution of $R^2$ under the null hypothesis that coefs of a regression model are zero.

I wonder if it is possible to say anything much about the distribution in the general case. — Richard Hardy, Jan 07 '21 at 14:48
What do you mean by "variance ... is not distributed on the real line"?? By *definition* the variance of a sample is a real number (and the variance of a random variable is an extended real number). — whuber, Jan 07 '21 at 14:51
That makes no sense at all if "R2" means the usual $R^2$ statistic (related to correlation). Your new phrase "mean and variance don't define it" makes little sense when "mean," "variance," and "define" are understood in their usual ways, either. Your terms need definitions. — whuber, Jan 07 '21 at 14:52
@RichardHardy yeah probably not absolutely general. I feel like there are restrictions that are looser than "just OLS" but stronger than "absolutely anything". But I'm not sure how to articulate them. My use case is logistic regressions and tweedie GLMs. But it'd be neat if I could apply this to a random forest or something too. Not essential. — generic_user, Jan 07 '21 at 14:53
@whuber your indulgence in excusing my misuse of terminology would be appreciated. Do you not understand what I'm asking? I'm looking for an analytic way of getting R2 that I'd otherwise have to bootstrap. If you've got framing suggestions, I'd appreciate. — generic_user, Jan 07 '21 at 14:54
I can't figure out what you might mean by "R2" and your terminology suggests that underlying this post may be some fundamental misunderstandings about what variance, means, and distributions are. These need to be sorted out. I'm not suggesting you're wrong (or ignorant) about anything, but only that there are just too many different ways to understand what you have posted. — whuber, Jan 07 '21 at 14:55
That's what I had thought, but with that understanding I cannot make any sense of your remark that "but R2 is bounded at 1, so mean and variance don't define it." — whuber, Jan 07 '21 at 15:00
That's not the point. The point is that the remarks "R2 is bounded at 1" and "so mean and variance don't define it" are completely unrelated. BTW, as you have defined R2, it easily can exceed $1,$ but I have presumed you intended some terms to be squared. Note, though, that $R^2$ can be *negative.* — whuber, Jan 07 '21 at 15:07
@whuber as above, you're not wrong, you're just unhelpful. Suggestions would be appreciated. — generic_user, Jan 07 '21 at 15:08
I can't offer suggestions until you explain what you might possibly mean! — whuber, Jan 07 '21 at 15:08
*Now* I see! The edits take this question in a direction that wasn't evident at the outset. I'm glad you made them. Your remark about assuming $y$ "is not categorical" is strange, though, because if $y$ is anything except a real numerical value, your formula makes no sense at all: how do you subtract one category from another and square the result? — whuber, Jan 07 '21 at 15:22
@whuber in that case $y$ would be a matrix whose rows have one 1 each. Common in classification problems. — generic_user, Jan 07 '21 at 15:25
I supposed you might have something like that in mind, but you need a rather different formula for whatever $R^2$ might mean--and there are many options. Underlying my remarks in this regard is that the (a) univariate (b) multivariate and (c) GLM versions of your question appear likely to produce different sampling distributions of $R^2$ (however $R^2$ might be defined), so it would be best to focus your inquiry on one particular situation. — whuber, Jan 07 '21 at 15:28

Sampling distribution of $R^2$

0 Answers0