2

I am used to think of correspondence analysis (CA) as dissecting the weighted departure from independence through singular value decomposition, but I cannot relate this to constrained correspondence analysis.

Say I want to analyse a n-by-p count matrix Y with samples in the rows and variables in the columns. Then R is a diagonal matrix with row sums of Y on the diagonal, and K a diagonal matrix with column sums of Y on the diagonal. $E = R11^TK/q$ with $1$ a properly sized vector of ones and $q$ the total sum of Y then represents the expected counts of Y under row-column independence. I would then pursue with a singular value decomposition of $$R^{-1/2}(Y-E)K^{-1/2} = U \Sigma V^T$$, which gives me a decomposition of how Y departs from E, weighted appropriately.

However, I want to perform constrained or canonical correspondence analysis (CCA), with a n-by-d constraining matrix Z with environmental variables. Based on verbal descriptions, I would expect I have to regress $R^{-1/2}(Y-E)K^{-1/2}$ on the rows of Z, with weights equal to the row totals of Y. I would then use the fitted values $$F = Z (Z^T R Z)^{-1}ZRR^{-1/2}(Y-E)K^{-1/2}$$ to obtain the part of departure from independence that can be explained by Z. I suppose the weighing occurs because samples with more counts are supposed to carry more information. I would expect to use the singular value decomposition of F directly to make a biplot to represent these departures in few dimensions. However Simple and Canonical Correspondence Analysis Using the R Package anacor and History of canonical correspondence analysis do not use this $R^{-1/2}(Y-E)K^{-1/2}$ matrix but work directly on Y. They decompose $$F' = (Z^T R Z)^{-1/2}Z^TYK^{-1/2} = P \Sigma Q^T$$ and set the column scores to be $S = K^{-1/2}Q$ and the variable scores to $B = (Z^T R Z)^{-1/2}P\Sigma$.

There are many things I do not really understand about CCA, but my main question is:

"Why is there no departure from independence term (in the trend of $X-E$) present in this formula anymore?"

It seems to be analyzing the count matrix Y directly rather than the departure from independence. How are the row and column scores then to be interpreted? I am looking for an intuitive explanation, supported by matrix equations.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Knarpie
  • 1,522
  • 9
  • 22
  • I'm not an expert on canonical CA so can't answer. But did you try to set the constraining data to a state where the constraint is no? Then, expectedly, results must be like of simple CA. And if not - then possibly you are doing something wrong in pre-processing? I believe that to understand the algorithm the best way is to try to code it yourself (perhaps not optimally programmically, but working). – ttnphns Aug 25 '17 at 15:44
  • Though it isn't of your interest perhaps, here is my explanation of svd-based _simple_ CA seen along with PCA and biplot. https://stats.stackexchange.com/q/141754/3277 – ttnphns Aug 25 '17 at 15:46
  • @ttnphns Thanks for your suggestions, but it is not that I cannot reproduce the result, I am looking for an explanation. In relation to your post: in my eyes, a biplot is a _plot_ to jointly represent two variables and a relationship between them. It is often the result of a dimension reduction, but it is not an analysis technique in its own right. See also the tag of "biplot" – Knarpie Aug 28 '17 at 12:17
  • Dimension reduction is an analysis. – ttnphns Aug 28 '17 at 12:25
  • But plotting it is not – Knarpie Aug 28 '17 at 12:35

0 Answers0