8

I'm using R to try and compare the results of variable chemical compositions, following on from an article I've read. In it, the authors used CDA to do something very similar to what I want to do, but I've been told by another researcher (without much of an explanation) that LDA would be better suited. I could go into the specifics of why supervised learning is the avenue chosen, etc. but I won't post that unless someone asks.

After doing some background reading (which hasn't really cleared up the difference between the two), I figured I'd try to explore this myself and compare the results. The primary difference between my data and that in this article is that instead of just using the compositions, I've created 3 new variables (S-, F- and V-) for the CDA that are functions of the original compositional data (see code below).

However, when I run the two analyses I get EXACTLY the same results - identical plots. This doesn't seem possible, but I can't find an error in my coding.

My two questions are:

  1. Is it possible for LDA and CDA to return the exact same result?

  2. What are the practical differences between LDA and CDA?


Data:

library(MASS)
library(candisc)
library(ggplot2)

al2o3<-runif(20,5,10)
sio2<-runif(20,10,30)
feo<-runif(20,40,60)
country<-c(rep("England",6), rep("Scotland",6), rep("Wales",4), rep("France",4))
df<-data.frame(country,al2o3,sio2,feo)

LDA:

lda <- lda(country ~ feo+sio2+al2o3, data=df)
plda <- predict(object = lda, newdata = df)
dataset = data.frame(country = df[,"country"], lda = plda$x)
ggplot(dataset) + geom_point(aes(lda.LD1, lda.LD2, colour = country))

CDA:

fvalue<-(df$also3/df$sio2)
svalue<-((2.39*df$feo)/(df$al2o3+df$sio2))
vvalue<-(df$sio2/df$feo)

mod <- lm(cbind(feo,sio2,al2o3) ~ country, data=df)
can2 <- candiscList(mod)
mod2 <- lm(cbind(fvalue,svalue,vvalue) ~ country, data=df)
can3 <- candiscList(mod2)
ggplot(can2$country$scores, aes(x=Can1,y=Can2)) + geom_point(aes(color=country))
amoeba
  • 93,463
  • 28
  • 275
  • 317
Scott
  • 195
  • 2
  • 7
  • 2
    Why are you surprised? That's just two names for the same thing. – amoeba Aug 02 '16 at 11:37
  • Thanks for the response amoeba - that's kind of the sneaking suspicion I've had... but then why do the two names exist? Do you know of any citation that the two techniques are identical? The introductory books I've looked through haven't said as much. I think I'd need some kind of reasoning to justify why I call it LDA v.s. CDA for my research. – Scott Aug 02 '16 at 14:36
  • 1
    What introductory book does the "CDA" name come from? – amoeba Aug 02 '16 at 14:50
  • It's from this particular article - looking at it again, it actually says "canonical linear discriminant analysis, or CDA". So if the two are the same, then I must have gotten mixed up by not seeing the acronym LDA. Then the presence of the `candisc` function made me even more confused. Thanks - repost your comment as an answer and I'll accept it! – Scott Aug 02 '16 at 14:59
  • I can't be sure what those authors of that article call canonical DA, but modern LDA _is_ canonical LDA (see footnote to [my answer](http://stats.stackexchange.com/a/190821/3277) for example) because the latent roots of $W^{-1}B$ matrix are called "canonical". – ttnphns Aug 02 '16 at 15:04

1 Answers1

4

These are two names for the same thing.

Linear discriminant analysis (LDA) is called a lot of different names. I have seen

and possibly some others. I suspect different names might be used in different applied fields. In machine learning, "linear discriminant analysis" is by far the most standard term and "LDA" is a standard abbreviation.


The reason for the term "canonical" is probably that LDA can be understood as a special case of canonical correlation analysis (CCA). Specifically, the "dimensionality reduction part" of LDA is equivalent to doing CCA between the data matrix $\mathbf X$ and the group indicator matrix $\mathbf G$. The indicator matrix $\mathbf G$ is a matrix with $n$ rows and $k$ columns with $G_{ij}=1$ if $i$-th data point belongs to class $j$ and zero otherwise. [Footnote: this $\mathbf G$ should not be centered.]

This fact is not at all obvious and has a proof, which this margin is too narrow to contain.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • `n rows and k columns with...` k-1 columns? – ttnphns Aug 05 '16 at 11:11
  • @ttnphns No, I really meant $k$ columns. I am certain that CCA will give LDA result in this case. Of course the $k$ columns are linearly dependent and any one of them can be dropped to make $G$ full rank; I think that after one of the columns is dropped, CCA will still give the same result, but I am not 100% sure at the moment. Can you confirm that? – amoeba Aug 05 '16 at 13:16
  • I don't know how it will give any result at all. The implementation which I [know well](http://stats.stackexchange.com/a/77309/3277) uses Cholesky function which won't allow singularity of any of the two correlation matrices. Besides, if even linear regression (standard algorithm) won't tolerate singularity, why multivariate regression (such as CCA) should allow it? – ttnphns Aug 05 '16 at 16:51
  • @ttnphns Oh, yes. Thanks for bringing it up. I figured out what's going on. The CCA should be applied between $X$ and $G$ (where $G$ has all $k$ columns as I wrote), but without centering $G$. Without centering it is full rank and can be inverted (or one can use Cholesky) without a problem. By the way, CCA-LDA correspondence works because $(G^\top G)^{-1} G^\top X$ will be a matrix filled with class means, so one quickly gets to the between-class scatter matrix via the CCA formulas. For this it is important that $G$ remains non-centered, in its original zeros-and-ones form. Does it make sense? – amoeba Aug 05 '16 at 22:05
  • 1
    I've updated [one of my answers](http://stats.stackexchange.com/a/169483/3277) to highlight the matter and how I find it. Please share your thoughts if you've got any. – ttnphns Aug 06 '16 at 16:48
  • @ttnphns I think what you wrote there is correct (by the way, this answer of yours as well as another related one that you now edited too, both have my enthusiastic +1s). For me, non-centering the indicator matrix $G$ and including all $k$ columns is more convenient (and not "less convenient") because it makes the algebra easier, at least in the derivations that I know. Perhaps there are some alternative derivations that would be easier with $k-1$ centered columns. And it might very well be that the latter approach is more convenient in practice. Thanks for this exchange, it was useful. – amoeba Aug 06 '16 at 23:17