How to perform imputation of values in very large number of data points?

Question

I have a very large dataset and about 5% random values are missing. These variables are correlated with each other. The following example R dataset is just a toy example with dummy correlated data.

set.seed(123)

# matrix of X variable 
xmat <- matrix(sample(-1:1, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
#M variables are correlated 

N <- 2000000*0.05 # 5% random missing values 
inds <- round ( runif(N, 1, length(xmat)) )
xmat[inds] <- NA 
> xmat[1:10,1:10]
         M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
sample1  -1 -1  1 NA  0 -1  1 -1  0  -1
sample2   1  1 -1  1  0  0  1 -1 -1   1
sample3   0  0  1 -1 -1 -1  0 -1 -1  -1
sample4   1  0  0 -1 -1  1  1  0  1   1
sample5  NA  0  0 -1 -1  1  0 NA  1  NA
sample6  -1  1  0  1  1  0  1  1 -1  -1
sample7  NA  0  1 -1  0  1 -1  0  1  NA
sample8   1 -1 -1  1  0 -1 -1  1 -1   0
sample9   0 -1  0 -1  1 -1  1 NA  0   1
sample10  0 -1  1  0  1  0  0  1 NA   0

Is there a (best) way to impute missing values in this situation? Is the Random Forest algorithm helpful? Any working solution in R would be much appreciated.

Edits:

(1) Missing values are randomly distributed among the variables and samples.As number of variables is very large (here in the example - 10000), while the number of samples is small here in the above dummy example it is about 200. So when we look at any sample over all the variables (10000), there is high chances that there will be missing value at some variable - due to large number of variables. So just deleting the sample is not option.

(2) The variable can be treated both as quantitative or qualitative (binary) in the process of imputing. The only judgement is how well we can predict it (accuracy). So predictions like 0.98 instead of 1 might be acceptable rather 0 vs 1 or -1 vs 1. I might need to tradeoff between computing time and accuracy.

(3) The issue I have thinking how overfitting can affect the results as the number of variables are large compared to number of samples.

(4) As the total quantity of missing values is about 5% and is random (not concentrated in any variables or samples as precaution was taken to remove the variables or samples that have very high missing values)

(5) Making data complete for analysis is first objective and accuracy is secondary. So not too sensitive to accuracy.

The reason *why* the data are missing bears strongly on the choice of appropriate technique. For instance, if the data are missing completely at random, you will lose little by dropping all cases with missing values (because the dataset is large and relatively few values are missing); but if the missingness is related to important variables in the analysis, dropping those cases can introduce a bias. — whuber, May 27 '14 at 20:48
@whuber I agree, it is not possible to remove for dataset of this size as every cases will have a missing value at least at one variable. This will cause total loss of data. — John, May 27 '14 at 21:06
That changes the question substantially, John, because in its current form it explicitly states otherwise: it asserts that only 5% of the values are missing. Even if we understand the 5% to apply to all *entries* in the data matrix, rather than 5% of the *cases,* anyone taking the example as indicative of the nature of your data would validly conclude that no more than 10 * 5% = 50% of the cases have missing values. The three things that are most important to describe in such questions are (1) the purpose of the analysis, (2) the nature of the missingness, and (3) the amount of missingness. — whuber, May 27 '14 at 21:12

score 8 · Accepted Answer · answered Jun 01 '14 at 15:58

There can be two ways of dealing with large variable and small sample (observation) problem, depending upon your situation and dataset.

(1) just use samples (observations) as variable provided that scores across variables is same or normalized.

(2) Use variables as variable but do some random sampling while imputing so that number variable is less than number of samples and finally merge data.

The following is workout, you can adjust to your need. I have assumption of variable is continuous but you workout similar for discrete variables. Here I am giving small example for quick check.

First, for workout generating correlated data, here the observations (samples) are correlated, may be realistic in situations were variables are assumed independent while observations are correlated. But in other situations where both observations and variables are correlated.

# example correlated data, correlated by observations 
# number of observations 
nobs = 200
nvars = 100
# number of variables 
# covariance matrix matrixCR to create correlated data 
matrixCR <- matrix(NA, nrow = nobs, ncol = nobs)
diag(matrixCR) <- 1
matrixCR[upper.tri (matrixCR, diag = FALSE)] <- 0.5
matrixCR[lower.tri (matrixCR, diag = FALSE)] <- 0.5
matrixCR[1:10,1:10]
L = chol(matrixCR)# Cholesky decomposition
nvars = dim(L)[1]
set.seed(123)
rM = t(L) %*% matrix(rnorm(nvars*nobs), nrow=nvars, ncol=nobs)
rownames(rM) <- paste("V", 1:nvars, sep = "") 
colnames(rM) <- paste("O", 1:nobs, sep = "")
rM[1:10,1:10]



# introduce missing values in random places 
N <- round(nobs*nvars*0.05,0) # 5% random missing values 
set.seed(123)
inds <- round ( runif(N, 1, length(rM)) )
rM1 <- rM
rM1[inds] <- NA

I am using missForest package for imputation, which depend upon the randomForest package to do so. You can do parallel computing if you have very large number of data points to impute.

# now use the rM1 matrix in imputation. 
require(missForest)
out.m <- missForest(rM1, maxiter = 10, ntree = 300)
# imputed 
imp.rM1 <- out.m$ximp

As this is simulated data set we have luxury of estimating the accuracy of imputation by comparing the original before missing values introduced with the imputed.

# actual values that were made missing 
aval <- rM[inds]
impv <- imp.rM1[inds]

# accuracy - defined as correlation between actual (before na introduction) and imputed values 
cor(aval,impv)
[1] 0.6759404

You can work around to increase the accuracy. Good luck !

score 5 · Answer 2 · answered May 25 '14 at 22:10

5

There are full books of data imputation so it is difficult to give an answer in this framework.

The easiest thing to do in this case is to pick one of the columns ($y$) and collect the other in a matrix $x$.

A model $y=f(x)$ is trained and the missing values are replaced with the values predicted by our model. Your data seems to be categorical so random forest can be a good choice.

If your dataset is very large make sure to use a fast algorithm or a scalable one.

answered May 25 '14 at 22:10

Donbeo

3,001
5
31
48

thanks, Do you have any book suggestion(s)? – John May 26 '14 at 12:47
nothing in particular. but if you google data imputation you can find a lot of things – Donbeo May 26 '14 at 12:51
Anyway if only few values are missing you can just remove all the line. From your datasets – Donbeo May 26 '14 at 12:51
3

Although you can always just delete cases with missing values, that sometimes would be a poor choice, depending on why the data are missing. – whuber May 27 '14 at 20:49
@whuber I totally agree with you but a lot of times this is just the safer choice. – Donbeo May 27 '14 at 20:50
I made that comment out of concern your answer could be misused or misunderstood unless it were suitably qualified. It is important to describe the circumstances for its proper application, so I would urge you to expand your answer in that regard. IMHO, merely hoping the reader might be in a situation where this is a "safer choice" (safer than what?) is inadequate. – whuber May 27 '14 at 20:55

score 5 · Answer 3 · edited May 28 '14 at 19:27

This is a really interesting question. I'm also looking for the same thing. Actually, there are lot of different ways to deal with it.

The first thing, in my opinion, will be to determine what type of missing data you have - missing completely at random (MCAR), missing at random (MAR), or missing not at random ( NMAR). This is difficult and controversial to prove but this paper shows an interesting way to look at MAR data.

To deal with multiple imputation R has a few packages:

MICE (which seems very used),
randomForest,
Hmisc
Amelia
mi

These are only few of the packages I found so far.

MICE also has implemented random forest and a few other methods, like predictive mean matching.

This is not much but may help you figure out some things. As soon as I have results or decide with which method I will proceed I'll edit the post.

Good luck!

If your data is MCAR you could use only complete case analysis. Many papers report that using complete case analysis with MCAR data is the best solution. At least, some of the papers I found report this, even when comparing with other imputation methods — psoares, May 28 '14 at 07:47

score 3 · Answer 4 · answered May 27 '14 at 21:31

Interesting question. The trick to this is that, in order to do multiple imputation, you need more than just a predictive model (which could/would be easy to obtain in, say, a machine learning approach). We'll call these models simulating models, since they're not quite probability models.

The combined aspect of feature selection (big $p$) and training a simulating model makes me think that a Bayesian approach is the best. That also means that there's not a clear approach to this. To me the best approach would have the following approach:

Identify all missingness patterns
For each pattern, use a Bayesian feature selection approach to assign posterior weights to complete cases in the data.
Randomly sample complete cases iteratively to generate complete data frames.

score 3 · Answer 5 · edited May 29 '14 at 03:45

3

Your problem seems tailor-made for some sort of low-rank matrix completion. Try using the impute.svd() function from the bcv package. I would suggest using a small rank (the argument k) -- something like 5.

edited May 29 '14 at 03:45

rdorlearn

3,493
6
26
29

answered May 28 '14 at 21:30

Innuo

1,418
10
11

How to perform imputation of values in very large number of data points?

5 Answers5

Linked

Related