4

A statement from the book Introduction to Statistical learning with applications in R, didn't quite make sense to me. It says, "In cases when number of predictors are greater than the instances we cannot even fit the multiple linear regression model using least squares". Why is that?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
GeneX
  • 622
  • 7
  • 14
  • This leads to multicollinearity. I have not read that book but I guess they mean $X^TX$ cannot be inverted, so the formula $\beta=(X^TX)^{-1}X^Ty$ cannot be used. – user112758 Jan 18 '17 at 03:30
  • You can pick the minimum norm least squares solution to break the infinite-way tiie for least squares solution. Read this https://see.stanford.edu/materials/lsoeldsee263/08-min-norm.pdf . As for @gung s answer, identifiability, schmidenitifability. If we're trying to make predictions, rather than trying to estimate coefficients, we dom't need no stinkim' identifiability. – Mark L. Stone Jan 18 '17 at 04:59
  • The quoted text says that F statistic can't be used when the number er of predictors are greater than the instances. But I think it makes too strong a statement, or at least a statement open to misinterpretation, in saying that a least squares model can not be fit when number of predictors is greater than number of instances (data points). – Mark L. Stone Jan 18 '17 at 05:11
  • @MarkL.Stone gung's answer is very informative at the level of the OP, but I truly would appreciate your giving a formal answer discussing how this is modified when regularization comes into play, for instance. I also think further discussion on matrix invertability and null spaces could be enlightening (this at a more basic level than your comments). Best wishes! – Antoni Parellada Jan 18 '17 at 13:20
  • @Antoni Parellada I'll let you have the honor/get the glory and points from providing the answer. Thanks for the suggestion though. – Mark L. Stone Jan 18 '17 at 13:32

1 Answers1

7

Such a model is not identifiable. That is, you cannot tell what variable contributed what amount because there are infinite possible combinations that are equally good. For starters, skip the regression model and just consider a simple sum:
$$ 5 = x_1 + x_2 $$ What numbers do you think $x_1$ and $x_2$ might be? They really could be anything. Here are some possibilities:
\begin{align} x_1 &= 2,\qquad &x_2 &= 3 \\ x_1 &= 3,\qquad &x_2 &= 2 \\ x_1 &= 5,\qquad &x_2 &= 0 \\ x_1 &= -10000,\qquad &x_2 &= 10005 \\ &\text{etc.} \end{align} There is no way to decide which set of estimated values is more likely.

The situation with multiple regression is the same. In fact, that is a multiple regression, albeit an exceedingly simple one. You have 1 instance and 2 predictors. We can make more elaborate examples with 2 instances, so that there is a nominal slope, but the fundamental problem is the same.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650