What is the correct formula for between-class scatter matrix in LDA?

Question

At one point in the process of applying linear discriminant analysis (LDA), one has to find the vector $v$ that maximizes the ratio $vBv'/vWv'$, where $B$ is the "between-class scatter" matrix, and $W$ is the "within-class scatter" matrix.

We are given the following: $k$ sets of $N_{i}$ ($i=1,...,k$) vectors $\mathbf{x}_{ij}$ ($i=1,...,k$; $j=1,...,N_{i}$) from $k$ classes. The class sample means are $\mathbf{\bar{x}}_{i}=\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}\mathbf{x}_{ij}$.

All sources I have looked at define $W$ as follows: $$W = \sum_{i=1}^{k}\sum_{j=1}^{N_{i}}(\mathbf{x}_{ij}-\mathbf{\bar{x}}_{i})(\mathbf{x}_{ij}-\mathbf{\bar{x}}_{i})^{T}$$

However, I have seen two different definitions for $B$. The first one, as described in Hardle et al., Applied Multivariate Statistical Analysis, 2003; Neil H. Timm, Applied Multivariate Analysis, 2002; and others, is: $$B = \sum_{i=1}^{k}N_{i}(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}})(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}})^{T}$$

Here, $\mathbf{\bar{x}}$ is the overall mean: $$\mathbf{\bar{x}}=\frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{N_{i}}\mathbf{x}_{ij}=\frac{1}{N}\sum_{i=1}^{k} N_{i}\mathbf{\bar{x}}_{i},$$ with $N=\sum_{i=1}^{N}N_{i}.$

The second one, as described in: Richard A. Johnson, Dean W. Wichern, Applied Multivariate Statistical Analysis 6th Edition, 2007; the Wikipedia article on LDA; the Scholarpedia article; and others, is: $$B^{*} = \sum_{i=1}^{k}(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}^{*}})(\mathbf{\bar{x}}_{i}-\mathbf{\bar{x}^{*}})^{T}$$ This time, $\mathbf{\bar{x}^{*}}$ is the mean of the means of the classes: $$\mathbf{\bar{x}^{*}} = \frac{1}{k}\sum_{i=1}^{k} \mathbf{\bar{x}}_{i}$$

I have worked out that both versions of $B$ are formulas for sample variance ($B^{*}$ is standard; for $B$, see wikipedia on weighted covariance). Now, I wonder:

Does anyone know the reason for the discrepancy between the formulas?
Which formula is "better"?
The two formulas should be "equivalent" in some sense; but in what sense precisely?

The second formula seems to be wrong (unless all $N_i$ are equal). $B+W$ should be equal to the total scatter matrix, and this will be true only if you are using the first formula, see [Deriving total (within class + between class) scatter matrix](http://stats.stackexchange.com/questions/8625). — amoeba, Nov 11 '14 at 07:54
@ amoeba - good point. I usually use \mathbf{}. Here, I had $\mu$'s instead of the $x_{i}$'s in an earlier draft, and \mathbf{} doesn't work with $\mu$'s. — rtm, Nov 11 '14 at 16:41
@ amoeba - the formula for $B^{*}$ is right here: http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Multiclass_LDA The notation is slightly different, but it's the same formula. — rtm, Nov 11 '14 at 16:42
@ amoeba's first comment: Thank you for that link; it does explain where one of the formulas comes from. The other formula, however, doesn't seem to be wrong, since Johnson & Wichern present it quite confidently... I wonder where it arrives from. — rtm, Nov 11 '14 at 16:53
I see; I usually use `\boldsymbol \beta` for greek letters ($\boldsymbol \beta$), it renders decently. You are right, the formula is indeed mentioned both on wikipedia and on scholarpedia. Obviously the formulas are identical if the number of samples is the same in all classes, but if not, the second formula looks misguided to me. Having $B+W$ equal to the total scatter $T$ is a nice property, and I don't see why one would want to use a formula that ruins it. In all machine learning books I know the first formula is consistently used. — amoeba, Nov 11 '14 at 21:26
In addition to the previous comment: you can look up how between-class variance is defined for usual univariate ANOVA. Everything is simpler there, because there is only one variable, but the formula is analogous to your first formula, not the second. I briefly looked in Johnson & Wichern, and you are right, they use the second formula, but I really cannot see or imagine any possible justification to it. — amoeba, Nov 11 '14 at 22:46
Thank you for sharing your thoughts, amoeba! It's good to know I'm not the only one who finds this discrepancy really strange. I also find the first formula (the one that uses the total mean) more intuitive - because of the ANOVA-like sum of variances that you mentioned, and because it uses a more natural population mean. — rtm, Nov 11 '14 at 23:50
See also this formulation: http://fourier.eng.hmc.edu/e161/lectures/classification/node4.html; I believe this is the "balanced" case — Benjamin, Feb 19 '16 at 20:32

amoeba · Accepted Answer · 2017-12-06T20:42:27.340

Within- and between-class scatter matrices in LDA are direct multivariate generalizations of the within- and between-class sums of squares in ANOVA. So let us consider those. The idea is to decompose the total sum of squares into two parts.

Let $x_{ij}$ be a $j$-th data point from the $i$-th class with $n_i$ data points. Total sum of squares and within-class sum of squares are given by the obvious expressions:

\begin{equation} T = \sum_{ij} (x_{ij} - \bar x)^2 \\ W = \sum_{ij} (x_{ij} - \bar x_i)^2 \end{equation}

Let us now derive the expression for the between-class sum of squares: \begin{equation} x_{ij} - \bar x = (\bar x_i - \bar x) + (x_{ij} - \bar x_i) \\ (x_{ij} - \bar x)^2 = (\bar x_i - \bar x)^2 + (x_{ij} - \bar x_i)^2 + 2(\bar x_i - \bar x)(x_{ij} - \bar x_i) \\ \sum_{ij}(x_{ij} - \bar x)^2 = \sum_{ij}(\bar x_i - \bar x)^2 + \sum_{ij}(x_{ij} - \bar x_i)^2 + 2\sum_i\left[(\bar x_i - \bar x)\sum_j(x_{ij} - \bar x_i)\right] \\ T = \sum_i n_i (\bar x_i - \bar x)^2 + W \end{equation}

and so we see that a reasonable definition for between-class sum of squares is $$B = \sum_i n_i (\bar x_i - \bar x)^2,$$ so that $T=B+W$.

The generalization to the multivariate case is straightforward: replace all $x^2$ by $\mathbf x \mathbf x^\top$, and that's it. So the correct expression for LDA is your first formula.

As I said above in the comments, I cannot imagine any justification for the alternative formula (what you called $B^*$). In all the machine learning textbooks I know, the standard formula is always used. See e.g. Bishop's "Pattern Recognition and Machine Learning".

Update

I think I realized when the alternative formula might make sense. If the classes are very different in size, then the between-class scatter matrix $$\mathbf B=\sum_i n_i(\bar{\mathbf x}_i - \bar{ \mathbf x})(\bar{\mathbf x}_i - \bar{\mathbf x})^\top$$ will be dominated by the large classes. Imagine three classes with large $n_1$ and $n_2$, and small $n_3$. Then $\mathbf B$ will be hardly influenced by the third class at all, hence LDA will be looking for projections separating first two classes but will not care much about how well the third class is separated. This is not always desired.

One might choose to "re-balance" such an unbalanced case and define $$\mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x}^*)(\bar{\mathbf x}_i - \bar{\mathbf x}^*)^\top,$$ where $\bar{ \mathbf x}^*$ is the mean of class means and $\bar n = \sum n_i / k$ is the mean number of points per class. This puts all classes on equal footing independent of their size, and might result in more meaningful projections.

Note that this will violate the decomposition of the sum of squares: $\mathbf T = \mathbf B + \mathbf W \ne \mathbf B^* + \mathbf W$, but this can be regarded as no big deal. However, the identity can be restored if the within-class and total scatter matrix are also defined in a "balanced" way:

\begin{equation} \mathbf T^* = \bar n \sum_{i} \frac{1}{n_i} \sum_j (\mathbf x_{ij} - \bar{\mathbf x}^*)(\mathbf x_{ij} - \bar{\mathbf x}^*)^\top \\ \mathbf W^* = \bar n \sum_{i}\frac{1}{n_i}\sum_j (\bar{\mathbf x}_{ij} - \bar{ \mathbf x}_i)(\bar{\mathbf x}_{ij} - \bar{\mathbf x}_i)^\top \\ \mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x}^*)(\bar{\mathbf x}_i - \bar{\mathbf x}^*)^\top. \end{equation}

If all $n_i$ are equal, these formulas will coincide with the standard ones.

Good. I'll flag your reply as the right answer, because it's quite reasonable and complete. If anyone reading this in the future knows the reasoning behind the definition of the other formula, please post. — rtm, Nov 14 '14 at 18:45
I realized what the reasoning behind the alternative formula might be and made a large update. — amoeba, Jul 03 '15 at 20:14

score 1 · Answer 2 · edited Mar 05 '17 at 08:39

1

[@ttnphns' remark: This answer looks like a question/comment to @amoeba's answer]

Amoeba, whats $\bar{x}_{ij}$ in W* stand for? Maybe just $x_{ij}$?

And in B* you wrote $\bar{x}$*, but maybe its must be $\bar{x}$ (whole mean)?

So:

\begin{equation} \mathbf T^* = \bar n \sum_{i} \frac{1}{n_i} \sum_j ({\mathbf x}_{ij} - \bar{\mathbf x})({\mathbf x}_{ij} - \bar{\mathbf x})^\top \\ \mathbf W^* = \bar n \sum_{i}\frac{1}{n_i}\sum_j ({\mathbf x}_{ij} - \bar{ \mathbf x}_i)({\mathbf x}_{ij} - \bar{\mathbf x}_i)^\top \\ \mathbf B^* = \bar n \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x})(\bar{\mathbf x}_i - \bar{\mathbf x})^\top. \end{equation}

? If so, then it's accurate coincide when all $n_i$ are equal, like so: \begin{equation} \mathbf W^* = \frac{kn}{k}\frac{1}{n} \sum_{i}\sum_j ({\mathbf x}_{ij} - \bar{ \mathbf x}_i)({\mathbf x}_{ij} - \bar{\mathbf x}_i)^\top = \sum_{i}\sum_j ({\mathbf x}_{ij} - \bar{ \mathbf x}_i)({\mathbf x}_{ij} - \bar{\mathbf x}_i)^\top\\ \mathbf B^* = \frac{kn}{k} \sum_i (\bar{\mathbf x}_i - \bar{ \mathbf x})(\bar{\mathbf x}_i - \bar{\mathbf x})^\top = \sum_i n_{[i]}(\bar{\mathbf x}_i - \bar{ \mathbf x})(\bar{\mathbf x}_i - \bar{\mathbf x})^\top. \end{equation}

edited Mar 05 '17 at 08:39

ttnphns

51,648
40
253
462

answered Mar 05 '17 at 04:04

Divien

21
2

1

Divien, if this answer is in effect an extended comment or criticism to @amoeba's answer (and you can't do without publishing it in the form of a separate answer), you should attract his attention by leaving a comment below his answer, "please look...". – ttnphns Mar 05 '17 at 08:36
Note that comments should be generally posted as comments, not answers. They may be deleted by moderators if posted as answers while they aren't real answers. – ttnphns Mar 05 '17 at 08:46
1

ttnphns, i would like to do that, but i can't, coz i need a 50 reputation points to do that. – Divien Mar 05 '17 at 23:56
1

Hi Divien! Unfortunately I did not notice your answer until now. Indeed there should be no bar over $x_{ij}$; I fixed it now. Thanks a lot. However, $\bar x^*$ I believe was used correctly. If you are still visiting our forum, let me know if it makes sense. (cc @ttnphns). – amoeba Dec 06 '17 at 20:44

What is the correct formula for between-class scatter matrix in LDA?

2 Answers2

Linked