Handling NAs in a regression ?? Data Flags?

Question

I am right now working with a big data set with about 30 different variables. Almost all of my rows have a missing value in at least one of the rows. I would like to run a regression with several of the variables. From my understanding of R (or any other stats programm) it will drop any observations that have at least one NA in the variables. Is there a way to stop R from doing that? I mean is it possible to let R ignore the missing values but still run the regression on the remaining ones?

One of my professors once told me that it is possible to use "data flags" so to create dummies that are equal to 1 when the value is NA and zero otherwise. I would create those flags for every variable with NAs. And then I set the NAs to zero, afterwards I can just include the flags in the regression. Thats what I was told if I remeber correctly. I now wanted to google this procedure but I could not find anything. I this a legit approach? Are there any risks or other problems?

If so is there another solution? I know about imputation and interpolation, which I can use for some of my variables, but not for all.

Just to make that clear, I do not have any NAs in my dependant variable.

1260 I lose about 90% when all the explanatory variables are included — ArOk, Jul 22 '18 at 19:13

Scortchi - Reinstate Monica · Accepted Answer · 2018-07-23T13:45:32.397

The "flagging method"—often called the "dummy variable method" or "indicator variable method"—is used mostly to encode predictors with not applicable values. It can be used to encode predictors with missing values; when you're interested in making predictions for new data-sets rather than inferences about parameters, & when the missingness mechanism is presumed to be the same in the samples for which you're making predictions.

The problem is that you're fitting a different model in which the non-missing slopes don't equate to the "true" slopes in a model in which all predictors are non-missing.^† See e.g. Jones (1996), "Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression", JASA, 91, 433. (An exception is in experimental studies in which predictors are orthogonal by design.)

Note that you can set the missing values to an arbitrary number, not just zero, for maximum-likelihood procedures.

† Suppose the model of interest is

$$\eta=\beta_0 + \beta_1 x_1 + \beta_2 x_2$$ where $\eta$ is the linear predictor. Now you introduce $x_3$ as an indicator for missingness in $x_2$: the model becomes

$$\eta=\beta'_0 + \beta'_1 x_1 + \beta'_2 x_2 + \beta'_3 x_3$$

When $x_2$ is not missing you set $x_3$ to $0$: $$\eta=\beta'_0 + \beta'_1 x_1 + \beta'_2 x_2$$

When $x_2$ is missing you set $x_3$ to $1$ & $x_2$ to an arbitrary constant $c$: $$\eta=\beta'_0 + \beta'_1 x_1 + \beta'_2 c + \beta'_3$$

Clearly when $x_2$ is missing, the slope of $x_1$ is no longer conditional on $x_2$; overall $\beta'_1$ is an average of conditional & marginal slopes. In general $\beta'_1 \neq \beta_1$.

How does this encoding scheme affect variance-inflation factors and general multicollinearity in the model? — blacksite, Oct 15 '20 at 18:04

score 0 · Answer 2 · answered Jul 22 '18 at 20:14

0

There is no way to “ignore” missing data in a regression procedure. You can impute missing data and there are many reference articles on the topic on Crossvalidated. The method you describe does not match a procedure I’m aware of.

answered Jul 22 '18 at 20:14

Todd D

1,649
1
9
18

score 0 · Answer 3 · answered Jul 22 '18 at 21:14

0

I would caution you against replacing missing value with arbitrary values like 1, 0, the mean of the feature, etc. The data is missing and it is not appropriate to fill it in arbitrarily.

The approach I take that usually works well is to examine your features. It is likely that a few of your features contain the bulk of the missing data. If this is the case, drop them. Although it's usually nice to have more features, if the data is largely missing from them they are not adding much value anyway. Having dropped the features with the most missing values, you may now drop the rows containing the remaining missing values. Usually this will leave you with a sufficient sample size. If not, consider imputation techniques.

answered Jul 22 '18 at 21:14

Chris

681
4
13

2

Note that the suggested approach is replacing missing data with arbitrary values *and* introducing an indicator variable for missing. There's a nice illustration [here](https://stats.stackexchange.com/a/7787/17230) of why the arbitrariness is just in the parametrization. – Scortchi - Reinstate Monica Jul 23 '18 at 10:25
2

Note also that row-wise deletion introduces bias unless data are missing completely at random. It's more that it's all right to drop a small proportion than that it's all right to be left with a sufficient sample size. – Scortchi - Reinstate Monica Jul 23 '18 at 10:28
@Scortchi makes a good point about dropping data. – Chris Jul 23 '18 at 13:56
Well if you think so, edit your answer & I can remove the comment. – Scortchi - Reinstate Monica Jul 23 '18 at 16:17

Handling NAs in a regression ?? Data Flags?

3 Answers3