Why is the number of components in Linear Discriminant Analysis bounded by the number of classes?

Question

Resources about LDA usually say the number of components is bounded by the number of classes - 1. E.g, in the binary case, only one component can be found.

In LDA, the first discriminant direction $\Phi_1$ is calculated as argmax of $\frac{\Phi_1^T S_b \Phi_1}{\Phi_1^T S_w \Phi_1}$ where $S_b$ and $S_w$ are the between-class and within-class covariance matrices, respectively. Why can't we continue this way and compute the $i$th direction $\Phi_i$ to be the argmax of $\frac{\Phi_i^T S_b \Phi_i}{\Phi_i^T S_w \Phi_i}$ under the constraint of orthogonality to $\Phi_1, \Phi_2 \dots \Phi_{i-1}$, as is done in PCA?

Ostenbily, in the binary case, where each $\Phi_i$ is a vector, one can do it $n$ times, if the inputs are $n$ dimensional vectors.

https://stats.stackexchange.com/a/190821/3277 "Then q=g−1=2 independent dimensions _will suffice_ to predict the class membership as precisely as formerly" — ttnphns, Feb 02 '20 at 12:23
If I understand this answer correctly, $c-1$ directions ($c$ is the number of classes) are enough to maximize accuracy under the normality and same-covariance assumption, but if one wants to perform dimensionality reduction, one can continue calculating all directions $\Phi_1, \Phi_2 \dots \Phi_n$? — user1767774, Feb 02 '20 at 12:48
If by dimensionality reduction you mean PCA, then yes you might extract up to n components. But if you mean dim. reduction by means of linear discriminants then their max. number is min(n, c-1). This is the dimensionality spanned by $Sw^{-1}Sb$ matrix. — ttnphns, Feb 02 '20 at 13:04
I meant dimensionality reduction via the LDA criterion. Can you please elaborate on the last sentence - what do you mean by the dimensionality spanned by $S_w^{-1}S_b$? if I have 2 classes, can't I find the direction $\Phi_2$ which is the *second best* direction in minimizing the LDA loss? — user1767774, Feb 02 '20 at 14:04
The number of nonzero eigenvalues of the aforementioned matrix is no greater than min(n, c-1). You could compute any number of eigenvectors, but all those corresponding to nonexistent eigenvalues (dimensions) are just numeric rubbish. — ttnphns, Feb 02 '20 at 14:35
`which is the second best direction in minimizing the LDA loss?` Please return to my first link. If you have 2 data clouds of identical cov matrices (I.e. identical shape and space orientation) there is _no_ "LDA loss" beyond the single dimension. One dimension suffice. LDA "loss" is _separability loss_, not variability loss like of PCA. — ttnphns, Feb 02 '20 at 15:08
OK, thanks. Can you please share a reference regarding the number of nonzero eigenvalues of $S_w^{-1}S_b$? — user1767774, Feb 02 '20 at 15:55
To follow your notation, n is the number of variables, c the number of groups. Then rank of Sw is (if no multicollinearity)=n and rank of Sb is (when data is centered, and LDA centers data) is c-1. Hence rank of $S_w^{-1}S_b$ is min (n,c-1). — ttnphns, Feb 02 '20 at 16:39

score 0 · Answer 1 · answered Apr 05 '21 at 14:44

The rank of between-class scatter matrix $S_B$ for the whole data set is at most $c-1$. ($c$ is the number of classes.) The individual between-class scatter matrix $S_{Bi}$ for one class is at most $1$. The former matrix is the weighted sum of the latter.

Since $rank(AB)\le{min(rank(A), rank(B))}$, you have $rank(S^{-1}_WS_B)\le{rank(S_B)}\le{c-1}$

Why is the number of components in Linear Discriminant Analysis bounded by the number of classes?

1 Answers1

Linked