I am using pairwise deletion to compute the correlation matrix of a data set. I think this approach is appropriate because:
- I have well under 10% missing values (~2%)
- I have only around 50% complete cases (so casewise deletion disregards too much data)
- Missing values are distributed evenly across cases and evenly across variables. (I have had difficulty running a proper test for MCAR as I have too many variables)
I am using the correlation matrix to perform a PCA and while I know there are no massive issues with the results, I am concerned that running significant tests based on the original n is not correct. Also I feel like I should be reporting some sort of adjusted 'post-deletion' n.
Is there any way to measure how much "information" (for want of a better term) I lose by using pairwise deletion compared to if I had a complete data set? In my case it does not really effect my result but I would like to know for in the future if I have data that has maybe 9-10% missing values and is MCAR. Should I be looking at some kind of imputation based method? Is there industry standards or rules of thumb?
Happy to hear opinions or be referred to papers/textbooks on this topic.