Regression where a subset of observations are missing data on an independent variable

Question

Consider the regression equations below:

\begin{align} Y_i &= \beta_0 + \beta_1 X_{i1} + \varepsilon_i \\ Y_j &= \beta_0 + \beta_1 X_{j1} + \beta_2 X_{j2} + \varepsilon_j \end{align}

where $Y_i,\ X_{i1},\ \varepsilon_i,\ Y_j,\ X_{j1},\ \& \ X_{j2},\ \varepsilon_j$ are vectors, and $_i$ and $_j$ index distinct sets of observations. The $_i$ respondents did not meet a qualification criterion and hence were not asked the question that corresponds to $X_2$.

The dependent variable and the first independent variable is the same in both regression equations but the second regression equation has an independent variable that is not present in the first. Obviously, I can estimate the two regressions separately but that will not be efficient. Therefore, I was considering re-writing the first one as:

$$ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_j $$

where $X_{i2}$ is a vector of $0$s.

Then I can estimate the parameter estimates by using OLS with the equation below:

$\left[ \begin{array}{ccc} Y_i\\ Y_j\end{array} \right] = \left[ \begin{array}{ccc} {\bf 1} & X_{i1} & X_{i2} \\ {\bf 1} & X_{j1} & X_{j2}\end{array} \right] \left[ \begin{array}{ccc} \beta_0\\ \beta_1\\ \beta_2\end{array} \right] + \left[ \begin{array}{ccc} \epsilon_i\\ \epsilon_j\end{array} \right]$

In the above equation, ${\bf 1}$ stands for a vector of $1$s of the appropriate dimension.

Is the above a standard approach to obtaining efficient estimates? Is there a name to this way of estimation?

In what sense is estimating a multiple regression equation both w/ the original numbers for a variable & w/ all $0$'s "more efficient" than just estimating the model w/ & w/o the variable? Have you tried estimating a model w/ a vector of $0$'s? It makes no sense, the $\hat\beta_2$ can equally well be any value. — gung - Reinstate Monica, Oct 31 '13 at 14:55
The dependent and the independent variables in the two equations correspond to different respondents. In other words, I am not estimating two different models on the same data. For example, the first equation corresponds to respondents 1 to 100 for whom there is no corresponding value for the second variable whereas the second equation corresponds to another set of respondents (say from 101 to 250). Thus, only $X_{i2}$ is set to $0$s and $X_{j2}$ is a vector of non-zero values. Estimating the two separately is likely to give me higher estimates for the error variance, no? — user32139, Oct 31 '13 at 15:08
Thanks for clarifying. I missed the point about the different respondents. What you have is [missing data](http://en.wikipedia.org/wiki/Missing_data). Writing in $0$'s for every observation is a very poor form of [imputation](http://en.wikipedia.org/wiki/Imputation_(statistics)). What you need to know about are proper imputation strategies to use here. — gung - Reinstate Monica, Oct 31 '13 at 15:23
I am aware of data imputation ideas. But, in this case, the respondents in the first group did not meet a qualification criteria and hence were not asked the question that corresponds to the second variable. So, strictly speaking, we do not have missing data as the respondents never got a chance to see the question in the first place. — user32139, Oct 31 '13 at 15:37
From a statistical perspective, those are still missing values. But this information is really important for potential answerers to know in order to be able to provide the appropriate answer. If there is any other relevant info about your situation, you should edit your Q & add it. — gung - Reinstate Monica, Oct 31 '13 at 15:42
@gung These are not missing values in the usual sense: they are a particular form of interdependent response. All you have to do is introduce a dummy variable to distinguish the two cases and interact it with $X_{\cdot 2}$ as described at http://stats.stackexchange.com/a/1795 and http://stats.stackexchange.com/a/1795. That is effectively what is proposed in this question. — whuber, Oct 31 '13 at 19:27

Regression where a subset of observations are missing data on an independent variable

0 Answers0