5

Suppose you had a method for estimating the population covariance of a vector-valued random variable given observations of that random variable, say $f(Z) \rightarrow C$, where the rows of $Z$ are observations of the random variable. Can one abuse this process to perform a least squares regression $y = x^T\beta + \epsilon$, for $n$-dimensional vector $x$? The idea is that one would have a vector of observations of $y$, call it $Y$, and a matrix of paired observations of $x$, call it $X$, then one would compute the covariance of $Z = [X\; Y]$ (concatenate the matrix next to the vector) via $f$, call it $C$, then let $\hat{\beta} = C_{1:n,1:n}^{-1} C_{1:n,n+1}$.

A few questions:

  1. Will this work under optimistic conditions? (a simple simulation in Matlab shows that the precision is not great, but the results are within 4 sig figs, so I am guessing it will.)
  2. Is this a known trick? if so, does it have a name I can google search for, or is it so trivial that it doesn't require a name?
  3. Most importantly, if $f$ can deal with input where some values are missing (say, MCAR--missing completely at random), under what conditions will this technique behave reasonably for regression with missing values?

edit I am assuming that $x$ is drawn from a zero mean process and that the regression has no intercept term.

shabbychef
  • 10,388
  • 7
  • 50
  • 93

1 Answers1

3

your "trick" seems to be the solution to the [so-called] normal equations for multiple regression - which is the usual least-squares answer in multiple regression.

as for missing data - what $f$ do you have in mind that knows how to get $C$ in that case?

there are methods like imputation for filling in missing values. perhaps little and rubin can give further information on the issues involved.

ronaf
  • 371
  • 2
  • 6