1

I was doing some reading on classification, and I read this quote attributed to Fisher in 1936:

vectors in one class behave differently from vectors in the other classes; and the variance within the classes differs maximally from that between the classes.

Please explain what is specifically meant by this.

The Pointer
  • 1,064
  • 13
  • 35
  • 1
    This must be explained somewhere in one of the many posts on LDA, linear discriminant analysis, like https://stats.stackexchange.com/questions/82497/can-the-scaling-values-in-a-linear-discriminant-analysis-lda-be-used-to-plot-e – kjetil b halvorsen Nov 01 '20 at 21:03
  • 1
    @kjetilbhalvorsen Ahh, yes, I see: "LDA is different for a very important reason - it creates new variables (LDs) by maximizing variance between groups." – The Pointer Nov 01 '20 at 22:34
  • @kjetilbhalvorsen Looking back at your comment (and mine), it isn't clear to me that it answers my question. I am fundamentally asking about what it means to say that "the variance within the classes differs maximally from that between the classes." – The Pointer Feb 01 '22 at 14:53
  • @ThePointer, can you provide a reference for the quote so that we can read it in context? – num_39 Feb 08 '22 at 05:24
  • @num_39 I wasn't given the full quote and/or a reference to it either. The only information I have is what I presented. – The Pointer Feb 08 '22 at 05:35
  • @The Pointer: I do not understand your last comment. This is just like calculating an F-ratio, and maximizing that? – kjetil b halvorsen Feb 08 '22 at 16:32

2 Answers2

1

There two concepts here:

  1. Variance within the class is the variance of the members of the class. For example, if class $C_1$ contains the vectors of $v_1, v_2, \ldots, v_n$, such that $\mu_{1} := \frac{\sum_{i=1}^{n}v_i}{n}$ the variance within the class $C_1$ is $\frac{\sum_{i=1}^{n}(v_i-\mu_{1})^2}{n-1}$.
  2. Variance between classes is the variance between the mean of the members of classes. For example, if means of classes $C_1$, $C_2$, $\ldots$, $C_k$ are $\mu_1, \mu_2, \ldots,$ and $\mu_k$ respectively. Hence, the variance between these classes is the variance between $\mu_1, \mu_2, \ldots,$ and $\mu_k$.

The latter will clear the sentence that you have quoted. Hence, it's pointed that if a variance within a class increases, not necessarily will affect the variance between classes and vice versa. The following shows the strategy of the Fisher for Discriminant Analysis (from "Analysis of Multivariate and High-Dimensional Data" book, written by Inge Koch):

Find the direction $e$ which minimises the within-class variance and maximises the between-class variance of $e^TX$, and work with these one-dimensional quantities. As in the previous analyses, the variability or, in this case, the difference in variability within and between classes drives the process.

In sum,

His key idea was to partition the data in such a way that the variability in each class is small and the variability between classes is large.

OmG
  • 1,039
  • 10
  • 13
  • Your variance between classes definition sounds convoluted, and I don't think you sufficiently address the meaning of the sentence as a whole to clarify what it means. – The Pointer Nov 01 '20 at 19:57
  • @ThePointer please see the updated. – OmG Nov 01 '20 at 20:01
  • You've provided the definitions of variance within classes and variance between classes, but this still doesn't explain what is meant by "the variance within the classes differs maximally from that between the classes." – The Pointer Nov 01 '20 at 20:03
  • There's nothing here about "maximally differing". – The Pointer Nov 01 '20 at 20:16
0

First, wording:

  1. A class is just a group or collection of something. In statistics and ML, these "something" are generally vectors.
  2. By "behave", he simply means that these vectors have different values.

Second, let's assume we have a group of vectors:

  1. The variance answer the following question: "How different is each vector from the average vector?"
  2. If all vectors behave similarly, then the variance in the group is 0.
  3. If they behave differently, then the variance will go up.

Third, intuition.

Let's say you want to split that group of vectors into $N$ classes.

  1. You want to make sure that vectors are as similar as they can be in each classes. Thus, your first criteria will be to say that you want to minimize variance within classes.
  2. Now, this is actually not enough; what you also want is that the variance between classes is maximised – you want that gap as big a possible.
  3. "Differs maximally" means maximise the difference between the variance within class and variance between class.

Fourth, the math.

Consider a group of vectors $x_j$, for $j$ from $1$ to $n$. Let $X$ be the matrix with $x_j$ as its columns: $X^j = x_j$. Assume we have $K_0$ classes, so that each $j$ belong only to one class. The function $K()$ gives the class of each $j$.

First we need to compute $\mu_k = Var(X^{j|K(j) = k})$

Then let $U$ be the matrix with columns $\mu_k$ : $U^k = \mu_k$

Note that " $j|K(j) = k$ " is a list of index $j$ that verify $K(j) = k$

Then the total variance can always be expressed as $$\text{Var}(X) = \text{Var}(U) + \sum_k \text{Var}(X^{j | K(j)=k}),$$

The total variance can always be decomposed as betweenVariance + withinVariance. Now, this is true, no matter how you have built the classes. In other words, it is true for any $K()$. But Fisher was looking to construct special classes, mainly the ones maximising: $$\text{Var}(U) - \sum_k \text{Var}(X^{j | K(j)=k}).$$ In layman's terms: the class for which $\text{Var}(U) $ differs maximally from $ \sum_k \text{Var}(X^{j | K(j)=k}),$

Romain
  • 1,471
  • 9
  • 9