Suppose I have the following dataset:
> df <- data.frame(y = c(1, 0, 0, 1), x1 = c(3.5, NA, NA, 4.5),
x2 = c(0, 1, 1, 0))
> df
y x1 x2
1 1 3.5 0
2 0 NA 1
3 0 NA 1
4 1 4.5 0
y
is my dependent variable, x1
is some continuous predictor, and x2
is a binary indicator with values of 1 when x1
is null else 0. I've read that, in lieu of imputing something like the median for a continuous feature, it would be better to use an indicator variable to say that feature is null for a particular case.
However, I'm struggling with the implementation in code of this model. R's default behavior (as is the same for Python's statsmodels.formula.api
) is to drop records where any predictor is null. See below. This effectively defeats the purpose of having the indicator variable x2
, since the records for which the indicator would appear are dropped:
> summary(lm(y ~ ., data=df))
Call:
lm(formula = y ~ ., data = df)
Residuals:
ALL 2 residuals are 0: no residual degrees of freedom!
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00e+00 NA NA NA
x1 3.14e-16 NA NA NA
x2 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: NaN, Adjusted R-squared: NaN
F-statistic: NaN on 1 and 0 DF, p-value: NA
> summary(lm(y ~ x1 * x2, data=df))
Call:
lm(formula = y ~ x1 * x2, data = df)
Residuals:
ALL 2 residuals are 0: no residual degrees of freedom!
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00e+00 NA NA NA
x1 3.14e-16 NA NA NA
x2 NA NA NA NA
x1:x2 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: NaN, Adjusted R-squared: NaN
F-statistic: NaN on 1 and 0 DF, p-value: NA
Is there some syntactic sugar I'm missing here? How can I best encode my model?