Why does discriminant analysis prevent us from finding more than $K-1$ linear 'features' and what does it mean?

Question

According to Bishop's Machine Learning and Pattern Recognition, the cost function for linear discriminant analysis (LDA) with $K>2$ classes is $$J(\mathbf w) = \mathrm{Tr}\left\{\left(\mathbf W \mathbf S_W \mathbf W^T\right)^{-1} \left(\mathbf W \mathbf S_B \mathbf W^T\right)\right\}.$$ It is stated in the text:

$\mathbf{S}_\mathrm B = \sum_{k=1}^K N_k(\mathbf m_k - \mathbf m)(\mathbf m_k - \mathbf m)^T$ is composed of the sum of $K$ matrices, each of which is an outer product of two vectors and therefore of rank 1. In addition, only $(K-1)$ of these matrices are independent and as a result of the constraint $$\mathbf m = \frac{1}{N}\sum_{n=1}^N \mathbf x_n = \frac{1}{N} \sum_{k=1}^K N_k \mathbf m_k.$$ Thus, $\mathbf S_\mathrm B$ has rank at most equal to $(K-1)$ and so there are at most $(K-1)$ nonzero eigenvalues. This shows that the projection onto the $(K-1)$-dimensional subspace spanned by the eigenvalues of $\mathbf S_\mathrm B$ does not alter the value of $J(\mathbf w)$, and so we are therefore unable to find more than $(K-1)$ linear 'features' by this means (Fukunaga, 1990).

Question

Why does the constraint cause the matrix to have only $(K-1)$ matrices?
I can't see how the projection does not alter the value of $J(\mathbf w)$ and therefore we are unable to find $(K-1)$ linear features.
What does it imply when we can't find $(K-1)$ linear features? I can't see why the author is telling me this.

(Can you please make the math explicit? I find the textbook rather hard to understand because many math details are left out).

1. $m$ is a linear combo of the others. Also $rank (A) = rank (A') = rank (AA') $ — Taylor, Jan 22 '16 at 04:44
3. Idk I'm on my cell phone and I don't have thr book in front of me :) — Taylor, Jan 22 '16 at 04:46
Not giving you the explicit math you require I'd point you to an intuitive explanation why `k-1` dimensions are sufficient to discriminate k groups under the assumption the groups have similar shape: http://stats.stackexchange.com/a/190821/3277. — ttnphns, Jan 22 '16 at 08:29

Why does discriminant analysis prevent us from finding more than $K-1$ linear 'features' and what does it mean?

0 Answers0

Linked