2

Suppose I have the following dataset:

> df <- data.frame(y = c(1, 0, 0, 1), x1 = c(3.5, NA, NA, 4.5), 
                   x2 = c(0, 1, 1, 0))
> df
  y  x1 x2
1 1 3.5  0
2 0  NA  1
3 0  NA  1
4 1 4.5  0

y is my dependent variable, x1 is some continuous predictor, and x2 is a binary indicator with values of 1 when x1 is null else 0. I've read that, in lieu of imputing something like the median for a continuous feature, it would be better to use an indicator variable to say that feature is null for a particular case.

However, I'm struggling with the implementation in code of this model. R's default behavior (as is the same for Python's statsmodels.formula.api) is to drop records where any predictor is null. See below. This effectively defeats the purpose of having the indicator variable x2, since the records for which the indicator would appear are dropped:

> summary(lm(y ~ ., data=df))

Call:
lm(formula = y ~ ., data = df)

Residuals:
ALL 2 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00e+00         NA      NA       NA
x1          3.14e-16         NA      NA       NA
x2                NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:    NaN, Adjusted R-squared:    NaN 
F-statistic:   NaN on 1 and 0 DF,  p-value: NA

> summary(lm(y ~ x1 * x2, data=df))

Call:
lm(formula = y ~ x1 * x2, data = df)

Residuals:
ALL 2 residuals are 0: no residual degrees of freedom!

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00e+00         NA      NA       NA
x1          3.14e-16         NA      NA       NA
x2                NA         NA      NA       NA
x1:x2             NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:    NaN, Adjusted R-squared:    NaN 
F-statistic:   NaN on 1 and 0 DF,  p-value: NA

Is there some syntactic sugar I'm missing here? How can I best encode my model?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
blacksite
  • 614
  • 1
  • 10
  • 22

0 Answers0