12

I know 2 approaches to do LDA, the Bayesian approach and the Fisher's approach.

Suppose we have the data $(x,y)$, where $x$ is the $p$-dimensional predictor and $y$ is the dependent variable of $K$ classes.

By Bayesian approach, we compute the posterior $$p(y_k|x)=\frac{p(x|y_k)p(y_k)}{p(x)}\propto p(x|y_k)p(y_k)$$, and as said in the books, assume $p(x|y_k)$ is Gaussian, we now have the discriminant function for the $k$th class as \begin{align*}f_k(x)&=\ln p(x|y_k)+\ln p(y_k)\\&=\ln\left[\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\right)\right]+\ln p(y_k)\\&=x^T\Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+\ln p(y_k)\end{align*}, I can see $f_k(x)$ is a linear function of $x$, so for all the $K$ classes we have $K$ linear discriminant functions.

However, by Fisher's approach, we try to project $x$ to $(K-1)$ dimensional space to extract the new features which minimizes within-class variance and maximizes between-class variance, let's say the projection matrix is $W$ with each column being a projection direction. This approach is more like a dimension reduction technique.

My questions are

(1) Can we do dimension reduction using Bayesian approach? I mean, we can use the Bayesian approach to do classification by finding the discriminant functions $f_k(x)$ which gives the largest value for new $x^*$, but can these discriminant functions $f_k(x)$ be used to project $x$ to lower dimensional subspace? Just like Fisher's approach does.

(2) Do and how the two approaches relate to each other? I don't see any relation between them, because one seems just to be able to do classification with $f_k(x)$ value, and the other is primarily aimed at dimension reduction.

UPDATE

Thanks to @amoeba, according to ESL book, I found this: enter image description here

and this is the linear discriminant function, derived via Bayes theorem plus assuming all classes having the same covariance matrix $\Sigma$. And this discriminant function is the SAME as the one $f_k(x)$ I wrote above.

Can I use $\Sigma^{-1}\mu_k$ as the direction on which to project $x$, in order to do dimension reduction? I'm not sure about this, since AFAIK, the dimension reduction is achieved by do the between-within variance analysis.

UPDATE AGAIN

From section 4.3.3, this is how those projections derived:

enter image description here

, and of course it assumes a shared covariance among classes, that is the common covariance matrix $W$ (for within-class covariance), right? My problem is how do I compute this $W$ from the data? Since I would have $K$ different within-class covariance matrices if I try to compute $W$ from the data. So do I have to pool all class' covariance together to obtain a common one?

avocado
  • 3,045
  • 5
  • 32
  • 45
  • As a side remark, I think Fisher's approach is more related with ANOVA. – avocado Feb 28 '14 at 10:26
  • LDA is indeed related to MANOVA (which is related to ANOVA), see [my answer here](http://stats.stackexchange.com/questions/82959/how-is-manova-related-to-lda/83935#83935). – amoeba Mar 01 '14 at 15:32
  • 1
    You question mixes up two things. I think you haven't digested our conversation over your [previous](http://stats.stackexchange.com/q/87479/3277) question. What you describe first is Bayesian approach to classification (not "Bayesian approach to LDA"). This approach can be used (1) with original variables as classifyers or (2) with discriminants obtained in LDA as classifyers. What is Fisher's approach then? – ttnphns Mar 06 '14 at 05:37
  • 1
    (Cont.) Well, ["Fisher's LDA"](http://stats.stackexchange.com/a/71571/3277) is simply LDA with K=2. When doing classification within such LDA Fisher invented his own formulas to do classification. [These formulas](http://stats.stackexchange.com/a/31384/3277) can work also for K>2. His method of classification is hardly used nowadays because Bayes approach is more general. – ttnphns Mar 06 '14 at 05:43
  • @ttnphns, yes, I also think Bayesian is just a way to do classification, it has nothing to do with LDA. By *Fisher's approach*, I mean the method of maximizing ratio of `bewteen-group variance over within-group variance`, and by this approach we find the directions on which to project original variables $X$ and get the discriminants, right? And I guess this is the very LDA, isn't it? – avocado Mar 06 '14 at 05:49
  • 1
    @ttnphns, the reason why I'm confused is because almost each book I referred to talk about LDA using this Bayesian approach, lecturing LDA as a generative model, they don't event mention the ratio of between-group variance and within group vairance. – avocado Mar 06 '14 at 05:51
  • I read quite few data mining texts (I'm more of old classic statistician), so don't know. I suspect these books you refer might be speaking about classification outside of LDA. Or they just describe LDA very superficially. Because it is impossible to account for LDA seriously without mentioning the between-within ratio objective goal. – ttnphns Mar 06 '14 at 06:02
  • @ttnphns, *between-within variance analysis* is the only authentic way to do LDA, right? I mean, to find those discriminants, I have to use this *between-within variance* approach, don't I? – avocado Mar 06 '14 at 06:18
  • 1
    @loganecolss: Have you seen my answer below? Do you have any questions about it? I am a bit confused, because I thought I explained what you are now asking again in the comments. "Between-within variance" approach is mathematically equivalent to "Bayesian approach" with an assumption of equal covariances. You can think of this as a surprising mathematical theorem, if you want. The proof is given in Hastie's book which is freely available online, and in some other machine learning textbooks as well. So I am not sure what "the only authentic way to do LDA" could mean; these two identical ways. – amoeba Mar 06 '14 at 08:01
  • @amoeba, yes, I absolutely read your answer, but I don't think the "Bayesian approach" is equivalent to "between-within variance" approach. If they were equivalent, then I should be able to derive those projection directions using Bayesian approach, but it seems I can't, just as what I said in my question 1). – avocado Mar 06 '14 at 10:55
  • 1
    @loganecolss: Believe me, they are equivalent :) Yes, you should be able to derive the projections, but you need an additional assumption of equal covariance matrices (as I wrote in my answer). See my comment below. – amoeba Mar 06 '14 at 11:10

1 Answers1

11

I will provide only a short informal answer and refer you to the section 4.3 of The Elements of Statistical Learning for the details.

Update: "The Elements" happen to cover in great detail exactly the questions you are asking here, including what you wrote in your update. The relevant section is 4.3, and in particular 4.3.2-4.3.3.

(2) Do and how the two approaches relate to each other?

They certainly do. What you call "Bayesian" approach is more general and only assumes Gaussian distributions for each class. Your likelihood function is essentially Mahalanobis distance from $x$ to the centre of each class.

You are of course right that for each class it is a linear function of $x$. However, note that the ratio of the likelihoods for two different classes (that you are going to use in order to perform an actual classification, i.e. choose between classes) -- this ratio is not going to be linear in $x$ if different classes have different covariance matrices. In fact, if one works out boundaries between classes, they turn out to be quadratic, so it is also called quadratic discriminant analysis, QDA.

An important insight is that equations simplify considerably if one assumes that all classes have identical covariance [Update: if you assumed it all along, this might have been part of the misunderstanding]. In that case decision boundaries become linear, and that is why this procedure is called linear discriminant analysis, LDA.

It takes some algebraic manipulations to realize that in this case the formulas actually become exactly equivalent to what Fisher worked out using his approach. Think of that as a mathematical theorem. See Hastie's textbook for all the math.

(1) Can we do dimension reduction using Bayesian approach?

If by "Bayesian approach" you mean dealing with different covariance matrices in each class, then no. At least it will not be a linear dimensionality reduction (unlike LDA), because of what I wrote above.

However, if you are happy to assume the shared covariance matrix, then yes, certainly, because "Bayesian approach" is simply equivalent to LDA. However, if you check Hastie 4.3.3, you will see that the correct projections are not given by $\Sigma^{-1} \mu_k$ as you wrote (I don't even understand what it should mean: these projections are dependent on $k$, and what is usually meant by projection is a way to project all points from all classes on to the same lower-dimensional manifold), but by first [generalized] eigenvectors of $\boldsymbol \Sigma^{-1} \mathbf{M}$, where $\mathbf{M}$ is a covariance matrix of class centroids $\mu_k$.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    +1. I might link also to my own answer mentioning QDA http://stats.stackexchange.com/a/71571/3277. – ttnphns Mar 06 '14 at 05:55
  • +1 for the part of addressing my question 2). I know that by doing the *between-within variance* analysis, I could find the best directions to project the original variable $X$ and get those discriminants. What I'm struggling with right now is *could I find those projection directions using Bayesian, without referring to the between-within variance ratio* ? – avocado Mar 06 '14 at 10:58
  • @loganecolss: As I said, you need to assume additionally that all classes have the same covariance matrix! Then starting with your Bayesian approach + this assumption, you can derive the standard LDA projections. The idea is to diagonalize $\boldsymbol \Sigma$. This is written in some detail in The Elements of Statistical Learning, section 4.3. – amoeba Mar 06 '14 at 11:10
  • I'll read that section later. As you said, assuming all classes having the same covariance matrix, I can derive a function which is the one I wrote in my post $f_k(x)$, right? And $f_k(x)$ is indeed a linear function of $x$, and according to your comment, $\Sigma^{-1}\mu_k$ should be the LDA projection matrix? – avocado Mar 06 '14 at 11:57
  • I update my post, adding a clip of the section 4.3 – avocado Mar 06 '14 at 12:12
  • @loganecolss: Just saw your update. Why don't you keep reading? :) Stuff you asking about is covered in 4.3.2-4.3.3. I will update my answer later. – amoeba Mar 06 '14 at 12:17
  • I read 4.3.2~4.3.3, and I'm still trying to digest those formulas. If I have any questions understanding that part, I'll come back again. Thanks for pointing that section out, ;-) – avocado Mar 08 '14 at 07:56
  • @amoeba, I read that section again and I have one more question, that is, how do I compute the common covariance matrix? I mean if I compute this covariance matrix from the data with $K$ classes, then each class will have a covariance matrix which might be different from others. How to obtain the common covariance? (I updated post adding how the projection is done in section 4.3.3). – avocado Mar 09 '14 at 05:28
  • @loganecolss: Good question. Let's say in each class $k$ you have a scatter matrix $S_k = (X-\mu_k)^\top (X-\mu_k)$ and a covariance matrix $C_k = S_k / N_k$. Then the common covariance matrix $W$ is equal to $\Sigma C_k / \Sigma N_k$. Or equivalently, it's a weighted average of the individual covariance matrices; if the number of elements is the same in all classes, then it's simply the average covariance matrix. – amoeba Mar 09 '14 at 16:23