1

I am using pairwise deletion to compute the correlation matrix of a data set. I think this approach is appropriate because:

  • I have well under 10% missing values (~2%)
  • I have only around 50% complete cases (so casewise deletion disregards too much data)
  • Missing values are distributed evenly across cases and evenly across variables. (I have had difficulty running a proper test for MCAR as I have too many variables)

I am using the correlation matrix to perform a PCA and while I know there are no massive issues with the results, I am concerned that running significant tests based on the original n is not correct. Also I feel like I should be reporting some sort of adjusted 'post-deletion' n.

Is there any way to measure how much "information" (for want of a better term) I lose by using pairwise deletion compared to if I had a complete data set? In my case it does not really effect my result but I would like to know for in the future if I have data that has maybe 9-10% missing values and is MCAR. Should I be looking at some kind of imputation based method? Is there industry standards or rules of thumb?

Happy to hear opinions or be referred to papers/textbooks on this topic.

bmrn
  • 111
  • 4
  • Why can't you impute the missing values perhaps by multiple imputation? – Michael R. Chernick Jun 22 '17 at 00:40
  • I can, although in this case it is not necessary. I was interested in knowing if there are any methods for adjusting n/measuring how the lost data affects my calculations. I guess this is not the done thing so I might be barking up the wrong tree. – bmrn Jun 27 '17 at 02:28

0 Answers0