Construct artificial slightly overlapping data for PCA plot

Question

I am trying to construct artificial data which show two distinct groups in a PCA plot. However, the two groups should still slightly overlap. The following approaches came the closest but I am still not happy with it. Even changing the parameters in rnorm and runif didn't work out. One group has always a much bigger spread than the other.

d1 <- {}
d2 <- {}
for(i in 1:20) {    
  d1 <- cbind(d1, rnorm(20, mean=100, sd=10))   
  d2 <- cbind(d2, rnorm(20, mean=110, sd=5))    
  # d1 <- rbind(d1, runif(100, 1,4))    
  # d2 <- rbind(d2, runif(100, 1,3))
}

d <- as.data.frame(rbind(d1,d2))
res <- prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit)
d$group <- as.factor(c(rep("g1", 20), rep("g2", 20))) 
ggplot(d, aes(V1,V2)) +                 
  geom_point(aes(color = group), size=4)

I don't follow what you are trying to do here. You have 2 orthogonal vectors, not 2 groups represented on 2 different dimensions each; that's not a setup that's going to yield anything, although as I say, I don't grasp what you ultimately want to show. — gung - Reinstate Monica, Aug 24 '12 at 19:01
well, two groups two vectors is pretty much the same. I think it is pretty clear what I am trying to achieve. I want two data clouds that partially overlap in a PCA plot. However, the (magnitude of) spread of the two clouds should be similar. I thought of something like that here: http://3.bp.blogspot.com/-C1wH5zqefoo/T9-Yl45-bdI/AAAAAAAABFo/oLOknGxZuKw/s1600/color.png. Instead of having 3 groups I want just two like the one in green and blue they are also partially overlap. — user969113, Aug 24 '12 at 19:12
Two groups-two vectors is not the same, which is why you're having difficulty. The picture helps. Lets say you get two groups that overlap somewhat on the 1st 2 principle components, what do you want to do with them? What is the point of this exercise? — gung - Reinstate Monica, Aug 24 '12 at 20:29
Great we are talking about the same thing now :-) I want to use that plot to explain some biological circumstances. — user969113, Aug 24 '12 at 20:47

whuber · Accepted Answer · 2012-08-24T21:10:22.357

To simulate a PCA, start with the desired result and work backwards: add some random error in the orthogonal directions and rotate that randomly.

The following example stipulates a two-dimensional result (i.e., two principal components) with two "blobs"; it is readily extended to more blobs. (More dimensions would take a bit more work in modifying sigma to accommodate higher-dimensional covariance matrices.)

Let's start with a random rotation matrix.

set.seed(17)
p <- 5     # dimensions
rot <- qr.Q(qr(matrix(rnorm(p^2), p)))

We generate the blobs separately from different multivariate normal distributions. The parameters (their means and covariance matrices) are buried in the mvrnorm arguments. To make it easy and reliable to specify the shape of such a distribution, we create a small function sigma to convert the angle of the principal axis and the two variances into a covariance matrix.

sigma <- function(theta=0, lambda=c(1,1)) {
  cos.t <- cos(theta); sin.t <- sin(theta)
  a <- matrix(c(cos.t, sin.t, -sin.t, cos.t), ncol=2)
  t(a) %*% diag(lambda) %*% a
}
library(MASS)
n1 <- 50   # First group population
n2 <- 75   # Second group population
x <- rbind(mvrnorm(n1, c(-2,-1), sigma(0, c(1/2,1))),
           mvrnorm(n2, c(0,1), sigma(pi/3, c(1, 1/3))))

Adjoin the orthogonal error and rotate:

eps <- 0.25  # Error SD should be small compared to the SDs for the blobs
x <- cbind(x, matrix(rnorm(dim(x)[1]*(p-2), sd=eps), ncol=p-2))
y <- x %*% rot

That's the simulated dataset. To check it, apply PCA:

fit <- prcomp(y)          # PCA
summary(fit)              # Brief summary showing two principal components
par(mfrow=c(2,2))         # Prepare to plot
plot(fit$x[, 1:2], asp=1) # Display the first two components $
plot(x[, 1:2], asp=1);    # Display the original data for comparison
points(x[1:n1,1:2], col="Blue") #...distinguish one of the blobs
screeplot(fit)            # The usual screeplot, supporting the summary
zapsmall(rot %*% fit$rotation, digits=2) # Compare the fitted and simulated rotations.

Graphics

The final check produces a matrix whose block form (an upper 2 by 2 block and lower 3 by 3 block with near-zeros elsewhere) confirms the accuracy of the PCA estimates:

       PC1   PC2   PC3   PC4   PC5
[1,]  0.58  0.81  0.05  0.00 -0.01
[2,]  0.81 -0.58  0.00  0.00  0.00
[3,] -0.01 -0.01  0.23 -0.62 -0.75
[4,]  0.03  0.03 -0.96 -0.28 -0.06
[5,]  0.00  0.00  0.18 -0.73  0.66

(The upper 2 by 2 block, suitably scaled by the first two eigenvalues, approximately describes the relationship between the point in the "fit" plot and those in the "original components" plot. This one looks like a rotation and a reflection.)

thanks a lot for this great post! I haven't yet fully understood your way of doing it but it's now much closer to what I'd like to have. I think I can adjust the values in here: x — user969113, Aug 24 '12 at 21:44

Construct artificial slightly overlapping data for PCA plot

1 Answers1

Linked