I have a very large dataset and about 5% random values are missing. These variables are correlated with each other. The following example R dataset is just a toy example with dummy correlated data.
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(-1:1, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
#M variables are correlated
N <- 2000000*0.05 # 5% random missing values
inds <- round ( runif(N, 1, length(xmat)) )
xmat[inds] <- NA
> xmat[1:10,1:10]
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
sample1 -1 -1 1 NA 0 -1 1 -1 0 -1
sample2 1 1 -1 1 0 0 1 -1 -1 1
sample3 0 0 1 -1 -1 -1 0 -1 -1 -1
sample4 1 0 0 -1 -1 1 1 0 1 1
sample5 NA 0 0 -1 -1 1 0 NA 1 NA
sample6 -1 1 0 1 1 0 1 1 -1 -1
sample7 NA 0 1 -1 0 1 -1 0 1 NA
sample8 1 -1 -1 1 0 -1 -1 1 -1 0
sample9 0 -1 0 -1 1 -1 1 NA 0 1
sample10 0 -1 1 0 1 0 0 1 NA 0
Is there a (best) way to impute missing values in this situation? Is the Random Forest algorithm helpful? Any working solution in R would be much appreciated.
Edits:
(1) Missing values are randomly distributed among the variables and samples.As number of variables is very large (here in the example - 10000), while the number of samples is small here in the above dummy example it is about 200. So when we look at any sample over all the variables (10000), there is high chances that there will be missing value at some variable - due to large number of variables. So just deleting the sample is not option.
(2) The variable can be treated both as quantitative or qualitative (binary) in the process of imputing. The only judgement is how well we can predict it (accuracy). So predictions like 0.98 instead of 1 might be acceptable rather 0 vs 1 or -1 vs 1. I might need to tradeoff between computing time and accuracy.
(3) The issue I have thinking how overfitting can affect the results as the number of variables are large compared to number of samples.
(4) As the total quantity of missing values is about 5% and is random (not concentrated in any variables or samples as precaution was taken to remove the variables or samples that have very high missing values)
(5) Making data complete for analysis is first objective and accuracy is secondary. So not too sensitive to accuracy.