Do OLS residuals tell us anything about the distribution of the error term?

Question

I'd like some intuition on a question that has long confused me. Suppose we have a data-generating process

$$y_i = x_i' \beta + \varepsilon_i $$

where $\mathbb{E}[\varepsilon_i] = 0$, $\varepsilon_i \perp x_i$, and $\varepsilon_i$ is drawn i.i.d. from a probability distribution $U$.

When we regress $y$ on $x$ in a finite sample, we will get a set of residuals $\hat{\varepsilon}_i$. My question is, do these residuals tell us anything about $U$?

It seems obvious that the greater the variance of $U$ (i.e. when the DGP is noisier), the greater we can expect the empirical variance of $\hat{\varepsilon}_i$ to be, so clearly they must be somewhat related. Moreover, if you indeed had $y$ and $x$ for the entire population, there would be no estimation error, so the empirical distribution of $\hat{\varepsilon}_i$ would match $U$ exactly. So, for finite samples, what can residuals tell us about the shape of the unknown distribution $U$?

The answer depends on *how* you perform the regression. Are you thinking of ordinary least squares regression? Note that even when you have the entire population, the regression estimates and residuals might still be wrong: you also need the right model and you need a suitable regression technique for fitting it. — whuber, Apr 24 '19 at 02:39
If you fit the 'true' model, your residuals are linear combinations of errors $r = (I-H)\epsilon$ (fitting by least squares and assuming it has an intercept). Typically (where leverages are small) residuals are highly correlated to errors -- but each residual is in general a different combination (each with its different distribution); if you combine those residuals you typically end up with a kind of "smeared" version of the original. It's easy to investigate behaviour via simulation for various hat-matrices ($H=X(X^\top X)^{-1}X^\top$) and error-distributions. — Glen_b, Apr 24 '19 at 02:57
@whuber: yes, I should have mentioned this in the body of my question but from the title I did mean OLS. Further, as implied from my use of the term "data-generating process," the model is correctly specified. — Kenneth, Apr 24 '19 at 20:22
@Glen_b: you're on the right track and I've done a bit of simulating with the $H$ from OLS that you provided, but actually got very little correlation between residuals and errors! — Kenneth, Apr 24 '19 at 20:25

score 1 · Answer 1 · edited Jun 11 '20 at 14:32

Residuals from linear regression can tell you something about the underlying distribution of the errors in the data. From Regression Analysis 7th edition by William Mendenhall pg 400 we can see the following:

In the case above the errors are not uniform. However, I prepared a couple examples which match your case.

a <- rbinom(100, 1, .5)
a <- sorted(a)
holder <- data.frame(1:length(a), a)

ols <- lm(holder$a ~ holder$X1.length.a.)

resid <- holder$a - predict(ols)

plot(predict(ols),resid)


normErr.Small <- lapply(1:100, function(x) {
  x + rnorm(n = 1, 1, sd = 1)
  
})

dat <- data.frame(x = 1:length(normErr.Small),y = unlist(normErr.small))

ols.small <- lm(y ~ x, data = dat)
plot(ols.small, which = 1, main = "normal 1 1")

#larger variance
normErr.Large <- lapply(1:100, function(x) {
  x + rnorm(n = 1, 1, sd = 10000)
  
})

dat <- data.frame(x = 1:length(normErr.Large),y = unlist(normErr.Large))

ols.Large <- lm(y ~ x, data = dat)
plot(ols.Large, which = 1, main = "normal 1 10000")

The first plot is generated with an error term drawn from a normal dist with mean 1 and sd 1. The second with mean 1 and sd 10000. As you can tell although both are homoskedastic the residuals are much larger in the second plot because of the distribution from which they are drawn.

While the shape of the distribution is reflected in the residuals the center is not.

normErr.Off <- lapply(1:100, function(x) {
  x + rnorm(n = 1, 10000, sd = 1)
  
})

dat <- data.frame(x = 1:length(normErr.Off),y = unlist(normErr.Off))

ols.Off <- lm(y ~ x, data = dat)
plot(ols.Off, which = 1, main = "normal 10000 1")

generates the following:

despite the dist U having mean 10000 and sd 1 the residual plot is essentially the same as the first plot from U with mean 1 and sd 1.

That is because changing the mean of the residuals merely shifts y over by the new mean

all code written in R

Good answer @NicoFish! I am in fact seeking an answer for a general probability distrubution $U$, so your excerpt from Mendenhall is useful. Do you know of any statistical/econometric papers that generalize on this? — Kenneth, Apr 24 '19 at 20:33
@Kenneth if you are interested in econometrics look at multiplicative model. y = E[y]*error. This produces residuals that are shaped like a sideways cone. — NicoFish, Apr 24 '19 at 22:09

Do OLS residuals tell us anything about the distribution of the error term?

1 Answers1

all code written in R