52

I am fitting an lm() model to a data set that includes indicators for the financial quarter (Q1, Q2, Q3, making Q4 a default). Using lm(Y~., data = data) I get a NA as the coefficient for Q3, and a warning that one variable was exclude because of singularities.

Do I need to add a Q4 column?

dpel
  • 109
  • 5
Fraijo
  • 1,018
  • 1
  • 7
  • 10

2 Answers2

56

NA as a coefficient in a regression indicates that the variable in question is linearly related to the other variables. In your case, this means that $Q3 = a \times Q1 + b \times Q2 + c$ for some $a, b, c$. If this is the case, then there's no unique solution to the regression without dropping one of the variables. Adding $Q4$ is only going to make matters worse.

Martin O'Leary
  • 3,697
  • 21
  • 27
  • 1
    I agree... there seems to be a problem with the dummy variables definitions. – Dominic Comtois Apr 03 '12 at 23:42
  • 23
    (+1). NA more generally means that the coefficient is not estimable. This can happen due to exact collinearity, as you've mentioned. But, it can also happen due to not having enough observations to estimate the relevant parameters (e.g. if $p > n$). If you predictors are categorical and you're adding interaction terms, an NA can also mean that there are no observations with that combination of levels of the factors. – Macro Apr 04 '12 at 00:59
  • 3
    $p > n$ is just a special case of colinearity - if there are fewer observations than predictors, colinearity is a given. You're right about interaction terms though, although I'm pretty sure that's not what's happening here. – Martin O'Leary Apr 04 '12 at 01:03
  • The variables are not linearly related, as Q3=1 iff Q1=Q2=0. Moreover, using stepAIC() and forcing the model to include all three of those variables causes no problems. Also, I have roughly 3x's the number of observations to variables. My best guess is there is colinearity between Q3 and some other variable, which I guess is one not included by the stepAIC. – Fraijo Apr 04 '12 at 04:43
1

I found this behavior when attempting to fit observations vs time, where time was given as POSIXct. lm and lsfit() both determined that the x's were co-linear. The problem was solved by subtracting the mean of the time datum to do the fit.

This appears to be a deficiency in the underlying code -- there must be some single precision operations, or non-optimal order of operations. I have never seen it before, so it may be new.

  • 2
    The problem [is well known.](https://stats.stackexchange.com/questions/202181) It arises because of a huge condition number, due to the fact that POSIXct values for recent dates are in the billions. When squared--as is necessary in the `lm` calculations--these values wipe out squares of other entries (such as the intercept) even in *double* precision. I suspect everybody gets nailed by this problem at some point in their career ;-). – whuber Sep 17 '21 at 21:23
  • Specifically, there is *no* single precision code involved. R doesn't use single precision anywhere in its computations. – Thomas Lumley Sep 18 '21 at 00:14