Statistical significance and angular comparison (similarity test) between two vectors, MCMC and resampling approaches

Question

I want to compare statistically if two vectors of PC1 loadings differ in terms of their direction which can be measured by the angle between them (e.g. this CV thread). I would like if someone can help me in understanding H0 used in the statistical tests outlined here.

Dataset and background:

For the demonstration purpose and reproducibility I will use dataset of three linear measurements on 48 turtles (24 males and 24 females) (Jolicoeur and Mosimann, 1960). I will use this dataset as I want to assess if males and females differ in their multivaraite size (e.g. considering all three measurements simultaneously). There is strong theoretical argumentation that PC1 loadings obtained from decomposition of covariance matrix (not correlation) of log transformed raw variables will represent overall (multivaraite) estimation of size (Klingenberg, 1996). Such PC1 vector of logged variables corresponds to a growth-axis, recording both isometric and allometric variations. If groups (e.g. sexes) share the same overall size - PC1 loadings of sexes have similar direction statistically speaking - it is reasonable to treat PC2-PCn as descriptors of overall shape (e.g. they correspond to variation not explained by general growth) of animals and "analytical" size-correction is possible, yet this approach has limitations (see McCoy et al., 2006). This approach assumes that 1) vector of PC1 loadings of each sex is isometric but not allometric (see test 1 below) and more importantly 2) PC1 loadings between sexes are pointing in same direction as otherwise size-correction will be biased just because groups do not share common size axis.

Two tests:

Do sex-specific PC1 loadings differ from theoretical isometric vector (i) which has the same length (number of elements, p) as the length of PC1 loadings but all values in i are identical to 1/sqrt (p) (e.g. comparison of sex-specific PC1 loadings to theoretical isometric vector).
Do PC1 loadings of sexes differ in their direction (e.g. explicit comparison between PC1 loadings of sexes).

Codes:

Test 1:

First test is explained and code provided in the book "Morphometrics with R" written by Julien Claude (pages 103-104). Function name is "coli" and according to text from the book it evaluates following:

"A possible statistic for isometry would correspond to the angle formed between the vector of slopes with a vector of similar size of all elements equal to a given constant. The can be passed to a coli function that checks for collinearity between two vectors. The function evaluates the angle between two vectors (ev1 and ev2) and compares it with the distribution of angles obtained from random vectors of similar size (from uniform distribution)."

Note that the original coli function has error in terms of coding and conclusion from it and it can be found in 1.81 errata of the book, so I provide it here:

        coli <- function (ev1, ev2, nperm = 1000, graph = TRUE) {
          # coli function written by Julien Claude (2008)
          # this one is updated version from Julien Claude RG
          # https://www.researchgate.net/publication/350439729_Rfunctions1_R_functions_for_geometric_morphometrics_developed_in_Claude_J_2008_Morphometrics_with_R
          
          dist <- numeric (nperm)
          n <- length (ev1)

          Angle <- function (v1, v2) {
          # angle is in degrees not radians
          90 - abs ((180 * (acos (sum (v1 * v2) / (sqrt (sum (v1^2)) * sqrt (sum (v2^2))))) / pi) 
         - 90)
          }

          for (i in 1:nperm) {# random vectors of size n generated from uniform distribution
          X1 <- runif (n, -1, 1)
          X2 <- runif (n, -1, 1)
          dist[i] <- Angle (X1, X2)
          }
          
          zobs <- Angle(ev1, ev2) # observed angle

          # significance
          pv <- length (dist[dist < zobs]) / nperm # IN THE BOOK originally it was 
          # "length(dist[dist>zobs])/nperm" but current version is after errata 1.81
          
          if (graph) {
          hist (dist, breaks = 50,
                main = "Distribution of angles between 2 random vectors", xlab = "Z statistic", 
          ylab = "# of random vectors", sub = paste ("Actual z-obs =",round (zobs, 5),": 
          p<", round (pv), 5))
            abline (v = zobs)
          }
          
          list (angle = zobs, p = pv)
}

Test 2:

Second test is based on work of Tzeng and Yeh (1999) which used randomization in order to access statistical significance of observed angle between two vectors. Below is short explanation of the procedure from the paper:

Step 1. Find the 1st eigenvectors for the 2 original data sets, and compute the angle between the 2 first eigenvectors.

Step 2. Combine the 2 data sets, and randomly partition total sample size n1 + n2 into 2 new data sets, of sizes n1 and n2. Find the 1st eigenvectors for the 2 new permuted data sets, and recompute the angle between the 2 new 1st eigenvectors.

Step 3. Repeat step (2) a large number of times (N) to find a sample distribution of the test statistic, 81, We chose N = 5000 times

I wrote the following code accordingly and I hope it's correct so far (yet I didn't use the absolute values of the angles as the PCs are arbitrarily oriented as suggested by the original paper):

coli.perm <- function (data, factor, nperm = 10000, graph = TRUE) {
  # function based on permutations according to Tzeng and Yeh (1999)
  # 'data' argument should be numeric matrix where columns are variables and rows are specimens
  # in the case of linear measurements, variables should be log transformed before the function call
  # 
  # 'factor' argument is a factor of the length same as the number of rows in the 'data'
  # 
  # 'nperm' is the number of permutations (resampling without replacement)
  
  # PCA is based on log transformed variables and their covariance matrix (not correlation) 
  
  dist <- numeric (nperm + 1) # empty vector of length equal to number of permutations + 1
  n <- length (factor) # number of specimens
  n.vars <- ncol (data) # number of raw variables (columns of the 'data')
  
  if (nrow (data) != n) stop ("number of rows in 'data' should be the same as the length of 'factor'")
  
  groups <- split(data, factor) # split the 'data' according to levels of 'factor'
  
  # function that calculates angle between 2 vectors
  Angle <- function (v1, v2) {
    # written by Julien Claude (2008)
    # sum (v1 * v2) = dot (inner) product of vectors v1 and v2; same as v1 %*% v2
    # sqrt(sum(V1^2)) = norm of vector v1
    # sqrt(sum(V2^2)) = norm of vector v2
    # acos(x) = arc-cosine of x
    # angle is in degrees!
    90 - abs ((180 * (acos (sum (v1 * v2) / (sqrt (sum (v1^2)) * sqrt (sum (v2^2))))) / pi) - 90)
  }
  
  # Step 1. Find the 1st eigenvectors for the 2 original
  # data sets, and compute the angle between
  # the 2 first eigenvectors, theta.
  pc1.obs <- lapply(groups, function (x) {
    prcomp(x[,1:n.vars], retx = TRUE, center = TRUE, scale. = FALSE)$rotation[,"PC1"]
  })
  
  if (!identical (names (pc1.obs[[1]]), names (pc1.obs[[2]]))) stop ("PC1 loadings in two vectors are not in the same order")
  
  dist [1] <-  Angle (pc1.obs[[1]], pc1.obs[[2]]) # observed angle between PC1 vectors (PC1 loadings)
  
  # Step 2. Combine the 2 data sets, and randomly
  # partition total sample size n1 + n2 into 2
  # new data sets, of sizes n1 and n2. Find
  # the 1st eigenvectors for the 2 new
  # permuted data sets, and recompute the
  # angle between the 2 new 1st eigenvectors, theta1.
  # 
  # Step 3. Repeat step (2) a large number of times
  # (N) to find a sample distribution of the test
  # statistic, theta1.
  for (i in 2:(nperm+1)){
    # start from 2 as 1 is reserved for observed angle between PC1 vectors (PC1 loadings)
    data.resample <- data [sample (x = 1:n, replace = FALSE), ] # resampling witout replacement
    groups.temp <- split(data.resample, factor) # divide permuted data according to levels of 'factor'
    
    pc1.resample <- lapply(groups.temp, function (x) {
      prcomp(x[,1:n.vars], retx = TRUE, center = TRUE, scale. = FALSE)$rotation[,"PC1"]
    }) # compute theta1 (H0 distribution)
    
    if (!identical (names (pc1.obs[[1]]), names (pc1.obs[[2]]))) stop ("PC1 loadings in two vectors are not in the same order")
    
    dist [i] <-  Angle (pc1.resample[[1]], pc1.resample[[2]]) # except angle in d[1] (theta) all other angles (theta1) are under H0
  }
  
  # Significance. The permutation results involves calculating the proportion
  # (P) of the observed theta1 values that are >= theta
  
  # PLEASE CHECK IF pv IS CORRECTLY CALCULATED
  pv <- sum (dist >= dist[1]) /  (nperm+1) # HERE I AM NOT SURE IF pv IS CORRECTLY CALCULATED 
  # AS THE ORIGINAL PAPER STATED THIS:
  # Since the orientation of the PCA vector is arbitrary,
  # the absolute value of the angle computed is used (and I didn't use 'abs' function).
  
  # plot histogram of H0 distribution and observed angle
  if (graph){
    a <- hist (dist, breaks = 50, plot = FALSE)
    hist(dist, breaks = 50, axes = FALSE, col = "lightseagreen",
         xlim = range (a$breaks), ylim = range (a$counts),
         main = "Distribution of angles between two vectors under H0 \nthat the 2 (eigen)vectors are the same",
         ylab = "Number of permutations",        
         xlab = expression (paste ("Angle"," (",bolditalic (theta*degree), ")", sep = "")))
    axis (side = 1, at = seq (0, max (a$breaks), by = 1), las = 1)
    axis (side = 2)
    
    arrows (x0 = dist[1], y0 = ceiling (max(a$counts/100)) * 100/4,
            x1 = dist[1], y1 = 0, col = "red", lwd = 3.5)
    
    text (x = dist[1],
          y = ceiling (max(a$counts/100)) * 100/4,
          pos = 3,
          labels = paste ("Observed\nangle:\n", round (dist[1], 3), ";\n", "p < ", round (pv, 3)))
    
  }
  
  res <- list (angle = dist [1] , p = pv, distribution = dist)
  res
}

Perform two tests on the dataset:

Test 1 apply to the data:

library (ade4)
data ("tortues")
turtles <- log (tortues[,1:3])
sex <- factor (tortues$sexe)

set.seed (34)
coli.males <- coli (prcomp(turtles[sex == "M", ])[[2]][,1], rep(sqrt(1/3),3), graph = TRUE, nperm = 10000)

set.seed (34)
coli.females <- coli (prcomp(turtles[sex == "F", ])[[2]][,1], rep(sqrt(1/3),3), graph = TRUE, nperm = 10000)
res <- rbind (unlist (coli.males), unlist (coli.females))
row.names(res) <- c ("males", "females")
res

Results of the Test 1:

           angle      p
males   7.843779 0.0099
females 6.831560 0.0078

According to the errata of the book such results can be interpreted as follows:

The two vectors are significantly collinear and we conclude that most shape variation expressed by the measurement is isometric

Questions related to Test 1:

Does it means that the H0 of the Test 1 evaluates the distribution of the angle between two random vectors, so H0 is: "Two vectors are not colinear (not similar) no more than expected by chance (randomness)?" So if we reject H0, we can say that two vectors are colinear and their interpretation makes sense. Is this correct? What if we obtained non significant p value? Or in the other hand, what is the H0 and how it can be interpreted? What are the limitations of the Test 1?

What is puzzling me is the part related to calculation of p value: it is "H0 < observed" but not "H0 > observed" and it is expected that random angles have greater angle (dissimilarity) between them relative to observed, so Test 1 "H0 < observed" evaluates proportion of observed angle that can be observed purely by randomness. Is this correct?

Test 2 apply to the data:

library (ade4)
data ("tortues")
turtles <- log (tortues[,1:3])
sex <- factor (tortues$sexe)

set.seed(34)
test2 <- coli.perm (data = turtles, factor = sex, nperm = 10000, graph = TRUE)
unlist (test2[1:2])

Results of the Test 2:

     angle          p 
7.37502965 0.00129987

According to the original paper H0 for the Test 2 is "The null hypothesis that we want to test is that the 2 first eigenvecters are the same."

Questions related to Test 2:

Does it mean that H0 of the Test 2 evaluates the distribution of angles between PC1's of sexes assuming no grouping of the data. Statistical significance is calculated as proportion of "H0 >= observed". Therefore, significant result of the Test 2 suggests that PC1 of sexes pointing in the different direction. Is this correct?

What if we use coli function to compare PC1 loading of sexes instead of sex-specific PC1 loadings vs isometric theoretical vector:

set.seed (34)
try <- coli(ev1 = prcomp(turtles[sex == "M", ])[[2]][,1],
          ev2 = prcomp(turtles[sex == "F", ])[[2]][,1], graph = TRUE, nperm = 10000)
unlist (try)

  angle       p 
7.37503 0.00880

Does this test "try" have any relations to the Test 2 in terms of conclusions and are they even comparable tests at all? If I apply verbal argumentations Test 2 says "PC1 vectors between sexes are colinear" and test "try" says: "PC1 vectors of sexes are not pointing in the same direction". Note that, significance is differently calculated in Test 1 (H0 < observed) and Test 2 (H0 >= observed) and I am not sure if that should be that way... Also, conclusions from test "try (vectors are colinear) appear contra intuitive to what we got from Test 2 ("vectors are NOT pointing in the same direction").

I appreciate any input.

Statistical significance and angular comparison (similarity test) between two vectors, MCMC and resampling approaches

0 Answers0