27

I have a large set of feature vectors which I will use to attack a binary classification problem (using scikit learn in Python). Before I start to think about imputation, I am interested in trying to determine from the remaining parts of the data if the missing data are 'missing at random' or missing not at random.

What is a sensible way to approach this question?


It turns out a better question is to ask if the data is 'missing completely at random' or not. What is a sensible way to do that?

Andre Silva
  • 3,070
  • 5
  • 28
  • 55
graffe
  • 1,799
  • 1
  • 22
  • 34
  • If there is no association between the missing indicator and any observed variable, then the missing data mechanism is MCAR. – Randel Sep 19 '15 at 16:30
  • @Randel What is a good test to apply in practice to test this ? – graffe Sep 19 '15 at 16:43
  • Nothing special than a correlation test or regression. – Randel Sep 19 '15 at 17:53
  • @Randel Would you mind saying a bit more, maybe as an answer? My feature vectors have many features. I could build a classifier separately for each feature with the classes "missing" or "not missing". Is that what you had in mind? It doesn't seem so obvious how to do this if different features can be missing in different vectors. – graffe Sep 19 '15 at 18:25
  • I believe that an author proposed a test to check if data is MCAR vs MAR. I can't seem to find it however. It isn't widely used though. As for your data, there is always the chance that it will be MNAR, but if you think that any of your data can help you predict missingness, then MAR should be a valid assumption. – RayVelcoro Sep 25 '15 at 00:34
  • 3
    It is not something you test, it is something you *assume*. – Tim Sep 25 '15 at 15:20
  • 5
    Just to be clear: missing *completely at random* means that the missingness probability is a constant, it depends on nothing. Missing *at random* means that missingness depends on some measured factors, like age or sex, so that you can use some models to fill in the missing patterns. Missing *not at random* means missingness depends on things you *did not* measure. In the question OP **says** NMAR vs. MAR but OP **means** MAR vs MCAR. – AdamO Feb 09 '18 at 22:18

5 Answers5

19

This is not possible, unless you managed to retrieve missing data. You cannot determine from the observed data whether the missing data is missing at random (MAR) or not at random (MNAR). You can only tell whether the data is clearly not missing completely at random (MCAR). Beyond that only appeal to plausibility of MCAR or MAR as opposed to MNAR based on what you know (e.g. reported reasons for why data is missing). Alternatively, you might be able to argue that it does not matter too much, because the proportion of missing data is small and under MNAR very extreme scenarios would have to happen for your results to be overturned (see "tipping point analysis").

Björn
  • 21,227
  • 2
  • 26
  • 65
  • 1
    Thank you very much. What's a good way to tell if the data is MCAR ? – graffe Sep 13 '15 at 16:53
  • @Björn, above may need to be reworded to reflect essential asymmetries in what we can learn from data. Although it is possible to *falsify* a hypothesis that data are MCAR (viz., by building a model that exploits the observed covariates to account for some part of the missingness), it is not possible to *confirm* MCAR or any other such hypothesis. – David C. Norris Sep 24 '15 at 19:53
  • Good point. I made that clearer. – Björn Sep 24 '15 at 23:57
8

I found the information I was talking about in my comment.

From van Buurens book, page 31, he writes

"Several tests have been proposed to test MCAR versus MAR. These tests are not widely used, and their practical value is unclear. See Enders (2010, pp. 17–21) for an evaluation of two procedures. It is not possible to test MAR versus MNAR since the information that is needed for such a test is missing."

RayVelcoro
  • 1,039
  • 1
  • 10
  • 19
  • The question asks about MAR vs MNAR, but your answer is about MCAR vs MAR. MCAR is completely different to MNAR. – Tim Feb 23 '17 at 04:12
  • If you can determine the data is MAR then that should suffice. As Bjorn says, it is not possible to tell if it is MAR/MNAR, but this answer is a good proxy for his question I believe. If you were to do Enders test and find that it is MCAR, then you wouldn't need imputation. If you find that it is MAR, then you can impute, or take a hard look at your data to see if there is reason to believe it may be MNAR. – RayVelcoro Feb 23 '17 at 23:08
  • @RayVelcoro It is an identifiability issue: it's possible for NMAR data to appear MCAR. Tim is right that NMAR (or the converse) is not something for which we test, it's something we assume. To your point about MCAR vs MAR, the more (most?) important thing is: if the data are MCAR and you use MAR methods, is there really any net effect on the data? I don't think so. Given the penetration, availability, and ease of use for MAR methods, maybe it's better to just use the non-parametric weighting or imputation procedure than engage in a rhetorical goose chase of tests and tests. – AdamO Feb 09 '18 at 22:21
4

This sounds quite doable from a classification standpoint.

You want to classify missing versus non-missing data using all other features. If you get significantly better than random results, then your data aren't missing at random.

Firebug
  • 15,262
  • 5
  • 60
  • 127
2

You want to know whether there is some correlation of a value being missed in feature and the value of any other of the features.

For each of the features, create a new feature indicating whether the value is missing or not (let's call them "is_missing" feature). Compute your favourite correlation measure (I suggest using here mutual information) of the is_missing features and the rest of the features.

Note the if you don't find any correlation between two features, it is still possible to have a correlation due to group of features (a value is missing as a function of XOR of ten other features).

It you have a large set of features and a large number of values, you will get false correlations due to randomness. Other than the regular ways of coping with that (validation set, high enough threshold) You can check if the correlations are symmetric and transitive. If they are, it is likely that they are true and you should further check them.

DaL
  • 4,462
  • 3
  • 16
  • 27
2

A method I use is a shadow matrix, in which the dataset consists of indicator variables where a 1 is given if a value is present, and 0 if it isn't. Correlating these with each other and the original data can help determine if variables tend to be missing together (MAR) or not (MCAR). Using R for an example (borrowing from the book "R in action" by Robert Kabacoff):

#Load dataset
data(sleep, package = "VIM")

x <- as.data.frame(abs(is.na(sleep)))

#Elements of x are 1 if a value in the sleep data is missing and 0 if non-missing.
head(sleep)
head(x)

#Extracting variables that have some missing values.
y <- x[which(sapply(x, sd) > 0)]
cor(y)

#We see that variables Dream and NonD tend to be missing together. To a lesser extent, this is also true with Sleep and NonD, as well as Sleep and Dream.

#Now, looking at the relationship between the presence of missing values in each variable and the observed values in other variables:
cor(sleep, y, use="pairwise.complete.obs")

#NonD is more likely to be missing as Exp, BodyWgt, and Gest increases, suggesting that the missingness for NonD is likely MAR rather than MCAR.
Phil
  • 365
  • 2
  • 14
  • 1
    In [VIM](https://cran.r-project.org/web/packages/VIMGUI/vignettes/VIM-Imputation.pdf), you can also check out spinoplots. They give a histogram of two variables with the missingness in each. We can plot two variables, and see how the missingness in one varies with the other. For example, if we we plot survival time and treatment assignment, if we see a right skewed distribution of missingness, we can posit that lower survival times are associated with more missingness...i.e. that missingness in treatment is MAR because it depends on the observed variable survival time. – RayVelcoro Nov 23 '15 at 19:10
  • 1
    The question asks about MAR vs MNAR, but your answer is about MCAR vs MAR. MCAR is completely different to MNAR. – Tim Feb 23 '17 at 04:13
  • @Tim As [AdamO](https://stats.stackexchange.com/questions/172316/a-statistical-approach-to-determine-if-data-are-missing-at-random/174180#comment621148_172316) stated in a comment below the question, OP meant MAR vs MCAR. – Phil Aug 12 '18 at 00:22