I used the prcomp()
function to perform a PCA (principal component analysis) in R. However, there's a bug in that function such that the na.action
parameter does not work. I asked for help on stackoverflow; two users there offered two different ways of dealing with NA
values. However, the problem with both solutions is that when there is an NA
value, that row is dropped and not considered in the PCA analysis. My real data set is a matrix of 100 x 100 and I do not want to lose a whole row just because it contains a single NA
value.
The following example shows that the prcomp()
function does not return any principal components for row 5 as it contains a NA
value.
d <- data.frame(V1 = sample(1:100, 10), V2 = sample(1:100, 10),
V3 = sample(1:100, 10))
result <- prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit)
result$x # $
d$V1[5] <- NA # $
result <- prcomp(~V1+V2, data=d, center = TRUE, scale = TRUE, na.action = na.omit)
result$x
I was wondering if I can set the NA
values to a specific numerical value when center
and scale
are set to TRUE
so that the prcomp()
function works and does not remove rows containing NA
's, but also does not influence the outcome of the PCA analysis.
I thought about replacing NA
values with the median value across a single column, or with a value very close to 0. However, I am not sure how that influences the PCA analysis.
Can anybody think of a good way of solving that problem?