Use residuals as dependent variable

Question

Suppose the following two stage regression (estimated using OLS).

Stage 1: $y_i = \alpha + \beta X_i + u_i$

Stage 2: $u_i = \gamma + \delta Z_i + v_i$,

where $y_i$, $X_i$, and $Z_i$ are random variables. $i$ denotes observation $i$. $u_i$ and $v_i$ are residuals and $\alpha$, $\beta$, $\gamma$, and $\delta$ are estimated coefficients.

I have two questions:

(i) Is the standard error of $\delta$ correct. If not, what is the reason?

(ii) How can I get the correct standard error?

Thank you a lot!

I want to generate a measure for a specific phenomena (represented by the residuals $u_i$). Finally, I want to research the relationship between this measure and a third variable $Z_i$. — Efissi, Feb 16 '16 at 20:59
Thank you a lot for your help! Unfortunately, I don't see the parallel. In your example, there is measurement error in an independent variable. How is that connected to my question? — Efissi, Feb 16 '16 at 21:07
Related: https://stats.stackexchange.com/questions/127001/analysing-the-residuals-themselves — Tim, Feb 16 '16 at 21:29
I have seen that; unfortunately, there is no fully insightful discussion of my two raised questions. — Efissi, Feb 16 '16 at 21:34
Your questions are not very clear. What do you mean is the standard error of $\delta$ "correct"? Do you mean the standard error for an estimator of $\delta$? If so it depends on you're estimating it. Also, why shouldn't we view this as a one stage model with $y_i = \alpha + \gamma + \beta x_i + \delta z_i + u_i + v_i$? — dsaxton, Feb 22 '16 at 14:27
@dsaxton, how would you estimate such a model? Edissi, maybe the following is relevant (I am not sure): Pagan, Adrian. "Econometric issues in the analysis of regressions with generated regressors." *International Economic Review* (1984): 221-247. — Richard Hardy, Aug 10 '16 at 16:35

jeiroje · Answer 1 · 2016-08-09T09:35:56.637

(sorry, this is not a full fledged answer but just my thoughts about it in the hope to let the discussion start again)

I am going through the same issue right now. I think what Efissi means with "is the sd of $\delta$ correct?" is exactly what @dsaxton commented, i.e., if the estimates and the confidence intervals of the coefficients calculated via the two-stage model are the same (up to some reparametrization) as in the one-stage model.

My personal view is yes, as long as $X$ and $Z$ are independent from each other. The reason is, if you imagine your data as a cloud in a multidimensional space and you do a (univariate) OLS linear regression w.r.t. $X$, you are projecting the data to the (unidimensional) subspace $X$, but you are not applying any transformation in the rest of the hyperplane. Thus any relationship to any other covariates should remain unchanged.

Mathematically, substituting the model for $u_i$ in the model for $y_i$:

\begin{split} y_i &= \alpha + \beta X_i + u_i =\\ &= \alpha + \beta X_i + (\gamma + \delta Z_i + v_i) =\\ &= (\alpha + \gamma) + \beta X_i + \delta Z_i + v_i \end{split}

thus, in particular, the $\delta$ should have the same properties in the two-stage model as well as in the one-stage model $Y \sim X + Z$.

Does it make sense or am I missing something? In particular, is the independence condition necessary?

UPDATE

A short check with real data:

dat <- mtcars

> lm1 <- lm(mpg ~ hp + qsec, data=dat)
> summary(lm1)$coef
               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 48.32370517 11.10330633  4.352191 1.526469e-04
hp          -0.08459304  0.01393281 -6.071497 1.309333e-06
qsec        -0.88657962  0.53458538 -1.658443 1.080072e-01


> lm2a <- lm(mpg ~ hp, data=dat)
> summary(lm2a)$coef
              Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
hp          -0.06822828  0.0101193 -6.742389 1.787835e-07

> lm2b <- lm(lm2a$residuals ~ dat$qsec)
> summary(lm2b)$coef
              Estimate Std. Error   t value  Pr(>|t|)
(Intercept)  7.8871608  6.8116279  1.157897 0.2560418
dat$qsec    -0.4418887  0.3797911 -1.163505 0.253795

For comparison:

> summary(lm(hp ~ qsec, data=dat))$coef
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 631.70375  88.699525  7.121839 6.382739e-08
qsec        -27.17368   4.945556 -5.494565 5.766253e-06

> summary(lm(qsec ~ hp, data=dat))$coef
               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 20.55635402 0.542424287 37.897186 6.728254e-27
hp          -0.01845831 0.003359377 -5.494565 5.766253e-06

Nevertheless, a simulation with artificially created independent $X$ and $Z$ delivered exactly the same values in the two models. Thus, the difference of estimates could be interpreted as some measure of dependence or correlation between the covariates, but honestly I don't know how to quantify and/or use it.

Use residuals as dependent variable

1 Answers1