LDA and Fisher LDA - are their weight vectors always equivalent?

Question

Linear Discriminant Analysis (LDA) and Fisher Linear Discriminant Analysis (FLDA) both project high-dimensional observations to univariate classification scores using different rationals and assumptions. For simplicity, I'm here considering the two-class case.

LDA assumes that the observations are normally distributed around the classes' expectancies with homoscedastic covariances. The weight vector that projects the observations into unidimensional classification scores is derived from the conditional probabilities of the observations under this model. The Wikipedia page on LDA specifies it as:

$$ \vec w = \Sigma^{-1} (\vec \mu_1 - \vec \mu_0) $$

FLDA defines a weight vector that projects the multivariate observations to univariate classification scores such that the ratio between between-class variance and the within-class variance is maximal. The same Wikipedia article specifies it as: $$ \vec w \propto (\Sigma_0+\Sigma_1)^{-1}(\vec \mu_1 - \vec \mu_0) $$

Immediately following the specification of the latter formula (the FLDA weight vector), the Wikipedia article states:

"When the assumptions of LDA are satisfied, the above equation is equivalent to LDA. "

However, since $\Sigma=\frac{1}{2}(\Sigma_0+\Sigma_1)$ (pooled covariance is a weighted average of within class covariances), these two weight vectors always point at the same direction, regardless of whether the assumptions (normality, homoscedasticity) hold.

Is the Wikpedia article wrong? Do LDA and FLDA always yield the same solution with respect to the weight vector's direction? Or am I missing some special case?

Forget about "Fisher LDA" It is outdated term (for me). Fisher, a pioneer of LDA, considered well and in detail only the k= 2-class situation. While he designed the so called _Fisher's classification functions_ for any k, this his solution was _not_ the dimensionality reduction solution that gives us the _discriminant functions_ - in the modern understanding of LDA as Rao's _canonical_ LDA. — ttnphns, Dec 18 '20 at 12:19
When the homogeneity of variances-covariances and normality assumptions hold Fisher's discrimination functions yield the same classification results as the modern (canonical) LDA. stats.stackexchange.com/a/31384/3277 — ttnphns, Dec 18 '20 at 12:32

Sextus Empiricus · Accepted Answer · 2020-12-19T14:14:00.637

However, since $\Sigma=\frac{1}{2}(\Sigma_0+\Sigma_1)$ (pooled covariance is a weighted average of within class covariances), these two weight vectors always point at the same direction, regardless of whether the assumptions (normality, homoscedasticity) hold.

This makes sense because the use of the pooled covariance depends on the assumption of equal covariance.

The use of the pooled covariance in LDA stems from the assumption of equal covariance in LDA. The use of LDA (with it's pooled covariance) implies the assumption of equal covariance. If you do not assume equal covariance (or if you believe it is not a good approximation) then you could decide to not use LDA with it's pooled variance. Instead one could use QDA which does not assume equal covariance.

If you assume unequal covariance but remain using the linear classifier LDA, then LDA still coincides with Fisher's LDA (because nothing changes about LDA if the assumptions change). But in that case it is not consistent with the assumptions.

Example

An example why one should not use pooled variance when the assumption of equal covariance does not hold is when you have two groups with different covariance and where one group dominates because there are more samples. In that case the pooled variance is dominated by this one group and this might lead to a less optimal linear discriminating function.

Below is a graphical example for such case.

QDA will draw a quadratic curve (or surface for multiple dimensions) through the points where the probability for both classes are the same $p(C_1) = p(C_2)$. Because the covariances are not the same this becomes a quadratic curve. (You can see that the line passes through the points where the iso-lines for the probability density are crossing each other)
LDA will not take into account the difference between the covariance matrices and assume that they are the same. Because in this example the one group has 10 times more datapoints it will dominate the estimate of the covariance and the method will assume that the other group is also having the covariance that relates to a distribution stretched in the horizontal direction instead of the vertical direction. In combination with the position of the means this results in a line that is relatively horizontal and not an optimal linear classifier.

LDA and Fisher LDA - are their weight vectors always equivalent?

1 Answers1

Example

Linked