I found the following code in R. Im not sure how much does it serve this purpose. But I want to implement this in python. How does this mostly convert to?? I also want to differentiate between all these categories, MCAR,MAR,MNAR
link:
statistical approach to determine if data are missing at random
following code is one of the answers for this question in above link
#Load dataset
data(sleep, package = "VIM")
x <- as.data.frame(abs(is.na(sleep)))
#Elements of x are 1 if a value in the sleep data is missing and 0 if non-missing.
head(sleep)
head(x)
#Extracting variables that have some missing values.
y <- x[which(sapply(x, sd) > 0)]
cor(y)
#We see that variables Dream and NonD tend to be missing together. To a lesser extent, this is also true with Sleep and NonD, as well as Sleep and Dream.
#Now, looking at the relationship between the presence of missing values in each variable and the observed values in other variables:
cor(sleep, y, use="pairwise.complete.obs")
#NonD is more likely to be missing as Exp, BodyWgt, and Gest increases, suggesting that the missingness for NonD is likely MAR rather than MCAR.
Though I wrote the following code:
import numpy as np
def chkIfDataMissingAtRandom(df):
df_binary = np.where(df.isnull(), 1, 0)
y = df_binary[df_binary.std(axis=1) > 0]
Not really sure how to completely extend it further.
Im not so keen to implement the above method only. Im open to new and more robust and better ideas.
I also found another approach in the following link:
how-to-check-missing-data-is-missing-at-random-or-not
One of the answers (Not really sure how much feasible is this):
"Here is one way to test the missingness-at-random assumption.
Suppose the question on participant's income has some missing entries. Run a logistic regression with income as your response and everything else as predictors. Your response would be 1 if it's missing, 0 otherwise. The p-value of the predictors should give you an idea whether this MAR assumption is any good."