3

I want to make a quantitative statement like "There is a 90% chance that this $X$-$Y$-data follows a linear model (with some noise added on top)". I can't find this kind of statement discussed in standard statistics texbooks, such as James et al.'s "An Introduction to Statistical Learning" (asking as a physicist with rudimentary statistics knowledge).

To be more precise: I'm assuming that some data is generated from $Y = f(X) + \epsilon$, where $f(X)$ is some exact relationship, e.g. the linear model $f(X) = \beta_0 + \beta_1 X$, and $\epsilon$ is noise drawn from a normal distribution with some unknown standard deviation $\sigma$. I want to calculate the probability that some proposed $\hat f(X)$ matches the actual $f(X)$.

I can do a least-squares fit to determine the estimate $\hat f(X)$. Now, if the model is correct ($\hat f(X) = f(X)$), then the residuals of the fit should exactly correspond to $\epsilon$. At the very least, if the data fits the model, there should be no correlation between the residuals and $X$. To be more quantitative, though, I would want to check that the residuals are in fact from a normal distribution with unkown $\sigma$ (although the residual standard error, RSE, will be an estimate for $\sigma$, so I could also assume that $\sigma$ is actually known). Isn't there some way to calculate a p-value for whether some given values (the residuals) are from a given distribution (normal distribution with RSE as the standard deviation)?

I'm not looking for the $R^2$ statistic, which will tell me how linear the data is, but also take into account the noise (larger $\sigma$ will lower the $R^2$ value). In my case, I don't care how noisy the data is, as long as it's normally distributed around the fit $\hat f(X)$.

  • 3
    have you looked at normaility tests?https://en.wikipedia.org/wiki/Normality_test – seanv507 Jan 01 '18 at 11:27
  • 2
    Checking the residuals has little to do with the other questions about goodness of fit of $f$ to the data. It may also help to be aware that the question as you phrase it can be answered only by supplying a prior probability distribution for $f$ and using Bayesian techniques--which eschew p-values. BTW, $R^2$ (by itself) [tells you little about linearity.](https://stats.stackexchange.com/questions/13314/is-r2-useful-or-dangerous/13317#13317) – whuber Jan 01 '18 at 15:15
  • 1
    Yeah, it took me some time to get clarity on what I'm asking. I'm indeed looking for a normality test on the residuals, and I'd then interpret the probability that the residuals are normal as the probability that the fitted $\hat f$ matches the actual $f$. So my question is really how to get a single p-value for the normality. – Michael Goerz Jan 01 '18 at 19:38
  • There are several diagnostics to run on a regression model; normality _per se_ of residuals might be among the least important. See [this page](https://stats.stackexchange.com/q/32600/28500) for hints on diagnostics. In frequentist analysis you can't get a _p_-value for having a given distribution, only a _p_-value that your data don't fit a given distribution; this _p_-value is often "significant" (non-normality) but the distribution may be close enough to normal for practical purposes. See [this page](https://stats.stackexchange.com/q/2492/28500) on the usefulness of normality testing. – EdM Jan 01 '18 at 20:03

2 Answers2

0

One difficulty in presenting a single number as you describe, is that the statistical confidence in the fitted model - here a straight line - can be different at the low or high end of the data range. A single number does not capture this, which you can see here in these animations of 95% confidence intervals for different curve fitting problems:

http://zunzun.com/CommonProblems/

James Phillips
  • 1,158
  • 3
  • 8
  • 7
0

QQ Plot

  • For comparing if the residuals are coming from normal distribution, you can use a QQ plot.
  • On the X-axis are the Theoretical quantile and Y axis are the distribution obtained from the data set. If the points in the figure lie on the straight line, then the model is useful and satisfied the normality assumption of linear regression.

enter image description here

KL divergence (Kulbeck Leibler divergence)

  • For comparing 2 normal distributions with $N_1 = \mu_1, \sigma_1$ and $N_2 = \mu_2, \sigma_2$
  • $KL(N_2, N_1) = \log{\frac{\sigma_2}{\sigma_1}} - \frac{1}{2} + \frac{\sigma_1^2 + (\mu_1-\mu_2)^2}{2\sigma_2^{2}}$ (lower means more match between 2 distributions)

    If the 2 distributions are the same, then the KL will be zero. For your use case, put $\mu_1 = 0$, $\mu_2 = 0$, and $\sigma$ as you wish.

For p-value

user3808268
  • 119
  • 5
  • Are you proposing the split the sample randomly into two groups, and use KL, or Levene's test to get a p-value that the two groups have the same standard deviations? That's not *quite* "probability that residuals are normal", but it's pretty close. – Michael Goerz Jan 01 '18 at 19:26
  • yes, but here one group will be residuals from the data ($\mu_1, \sigma_1$) and other group will be given by generating equal number of samples from random distributions with mean and std deviation (mean and unknown sigma you are talking about). – user3808268 Jan 01 '18 at 20:32
  • Generally speaking, you can use and F-test to compare variances of 2 data sets ($\sigma_1, \sigma_2$), but F-test is **very** sensitive to normality. Thus I suggested Levene's Test. But F-test is very very easy to use and it's also generally used for comparing variances. [F-test for comparing variances](http://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/f-test/). At the end it also gives a p-value. – user3808268 Jan 01 '18 at 20:37