10

My dataset ($N \approx 10,000$) has a dependent variable (DV), five independent "baseline" variables (P1, P2, P3, P4, P5) and one independent variable of interest (Q).

I have run OLS linear regressions for the following two models:

DV ~ 1 + P1 + P2 + P3 + P4 + P5
                                  -> R-squared = 0.125

DV ~ 1 + P1 + P2 + P3 + P4 + P5 + Q
                                  -> R-squared = 0.124

I.e., adding the predictor Q has decreased the amount of variance explained in the linear model. As far as I understand, this shouldn't happen.

To be clear, these are R-squared values and not adjusted R-squared values.

I've verified the R-squared values using Jasp and Python's statsmodels.

Is there any reason I could be seeing this phenomenon? Perhaps something relating to the OLS method?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Cai
  • 279
  • 1
  • 12
  • 1
    numerical issues? The numbers are quite close to each other... –  Dec 04 '17 at 15:20
  • @user2137591 This is what I'm thinking, but I have no idea how to verify this. The absolute difference in R-squared values is 0.000513569, which is small, but not *that* small. – Cai Dec 04 '17 at 15:21
  • 1
    I hope you know linear algebra: if $\mathbf{X}$ is the design matrix of the above, could you please compute $\det\mathbf{X}^{T}\mathbf{X}$, where $T$ is the matrix transpose and $\det$ is the matrix determinant? – Clarinetist Dec 04 '17 at 15:25
  • 8
    Missing values get auto-dropped? – generic_user Dec 04 '17 at 15:29
  • 1
    0.000513569 is a very small number: it is 0.41 percent change. It is very possibly a numerical issue. What Clarinetist is trying to say is that maybe your design matrix has a poor condition number and when inverting it is numerically instable... –  Dec 04 '17 at 15:32
  • @generic_user AH yes this is it. Missing values for Q but not for other columns. If you write your comment into a brief answer I'll mark it as accepted. Thank you! – Cai Dec 04 '17 at 15:44

1 Answers1

25

Could it be that you have missing values in Q that are getting auto-dropped? That'd have implications on the sample, making the two regressions not comparable.

generic_user
  • 11,981
  • 8
  • 40
  • 63