15

There is one variable in my data have 80% of missing data. The data is missing because of non-existence (i.e. how much bank loan the company owes). I came across an article saying that dummy variable adjustment method is the solution for this problem. Meaning that I need to transform this continuous variable to categorical?

Is this the only solution? I do not want to drop this variable as I think theoretically, it is important to my research question.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
lcl23
  • 205
  • 3
  • 6

2 Answers2

24

Are the data "missing" in the sense of being unknown or does it just mean there is no loan (so the loan amount is zero)? It sounds like the latter, in which case you need an additional binary dummy to indicate whether there is a loan. No transformation of the loan amount is needed (apart, perhaps, from a continuous re-expression, such as a root or started log, which might be indicated by virtue of other considerations).

This works well in a regression. A simple example is a conceptual model of the form

$$\text{dependent variable (Y) = loan amount (X) + constant.}$$

With the addition of a loan indicator ($I$), the regression model is

$$Y = \beta_I I + \beta_X X + \beta_0 + \epsilon$$

with $\epsilon$ representing random errors with zero expectations. The coefficients are interpreted as:

$\beta_0$ is the expectation of $Y$ for no-loan situations, because those are characterized by $X = 0$ and $I = 0$.

$\beta_X$ is the marginal change in $Y$ with respect to the amount of the loan ($X$).

$\beta_I + \beta_0$ is the intercept for the cases with loans.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • "additional binary dummy", meaning I need to create another variable, whereby 1=with loan & 0=no loan? At the same time, still putting in the original Loan variable? If so, 80% of the cases will be treated as missing & only 20% left for analysis. 20% of total sample is too little for me to run logistic regression. (sorry, i'm still in the learning stage) – lcl23 Jan 26 '11 at 07:01
  • 2
    They won't be treated as missing, they'll go into estimating the value for no loan. Maybe you've made no loan 'NA' in which case you need to recode those to 0. – John Jan 26 '11 at 09:10
  • 3
    @John Thank you, that's exactly what I am recommending. The point is to express the loan values ($X$) in any way appropriate (such as log(amount+1)) and set $X=0$ and $I=1$ for any case without a loan. This is a standard technique in regression, including logistic regression. – whuber Jan 26 '11 at 15:10
  • Thanks! I get your point. Anyway, is imputation work in my case? – lcl23 Jan 27 '11 at 02:54
  • 3
    @lcl23 If I understood the situation correctly, imputation makes no sense: your "missing" data aren't missing; they indicate no loan has been taken out. – whuber Jan 27 '11 at 03:44
  • Hi @whuber, I'm confused about the value should be given to the I variable; should be I = I(X == 0) or I = I(X != 0)? I've seen both in similar questions but I'm not sure about the interpretation. I was writing here some though about of the two situations, but I'm very lost so I hope you could give me a priming. Specifically I'm confused about which baseline is used in the I(X == 0) case and spurious additive effects in the I(X != 0) case. – Bakaburg Mar 10 '15 at 17:56
  • @Bakaburg The coding and the interpretation of the variables are explained in the last three lines of my answer. – whuber Mar 10 '15 at 18:26
  • the case you described is the I = I(X == 0) case if I interpreted correctly. My doubt in this case is how to interpret coefficients when X != 0, because I understand you are adding a different intercept when I == 1. Wouldn't this change the predicted values of Y in comparison of a model without a I dummy variable? Then there is the case in which I is I(X != 0) like in http://stats.stackexchange.com/questions/56306/time-spent-in-an-activity-as-an-independent-variable. In this case $\beta_0$ represent? – Bakaburg Mar 10 '15 at 18:38
  • DISCLOSURE: Specifically in my case I have to deal with lower detection limits of some biomarker, which make predictors look zero inflated. – Bakaburg Mar 10 '15 at 18:40
  • 1
    @Bakaburg I think you might have got it backwards, but it doesn't matter--the two models (using $I(X=1)$ versus $I(X=0)$) will be equivalent. The predicted values in the models with and without such an indicator will differ, so I don't understand what you are trying to ask. Note that "nondetect" differs profoundly from "doesn't exist"! If your detection limits are small enough, there shouldn't be any need to introduce a dummy for them; and if there is a need, then introducing a dummy may be a little too crude. In that case consider methods of analyzing censored or interval-valued data instead. – whuber Mar 10 '15 at 18:49
  • I tried to solve the problem dividing the predictors in 3 classes: below detection and then below and above the median of what's detectable. But it's indeed crude and the result depends on the medians in my sample. Instead I found interesting your approach with the dummy, so that I can have at one time both the categorical effect and the continuos effect of the predictors. I'm just confused about how to interpret the coefficients for the dummy variable. Should I make a new question? – Bakaburg Mar 10 '15 at 19:58
  • 1
    The answer is right here. When the dummy is $1$, the value $\beta_I$ is added to the prediction. When the dummy is $0$, that value drops out. That's all there is to it. – whuber Mar 10 '15 at 20:25
1

I think you have misunderstood the suggestion of the article: mainly because the suggestion makes no sense. You would then have two problems: how to recode a variable and its values are still missing. What was probably suggested was to create a missingness indicator.

A somewhat relevant approach to handling missing data which loosely matches this description is to adjust for a missingness indicator. This is certainly a simple and easy approach, but in general it is biased. The bias can be unbounded in its badness. What this does effectively is fit two models and average their effects together: the first model is the fully conditional model, the second is a complete factor model. The fully conditional model is the complete case model in which each observation is deleted that has missing values. So it is fit on a 20% subset of the data. The second is a fit on the remaining 80% not adjusting for the missing value at all. This marginal model estimates the same effects as the full model when there is no unmeasured interaction, when the link function is collapsible, and when the data are Missing at Random (MAR). These effects are then combined by a weighted average. Even under ideal conditions, no unmeasured interactions, and missing completely at random (MCAR) data, the missing indicator approach leads to biased effects because the marginal model and conditional model estimate different effects. Even predictions are biased in this case.

A much better alternative is to just use multiple imputation. Even when the mostly-missing factor is measured at a very low prevalence, MI does a relatively good job of generating sophisticated realizations of what possible values may have been. The only necessary assumption here is MAR.

AdamO
  • 52,330
  • 5
  • 104
  • 209