First, wording:
- A class is just a group or collection of something. In statistics and ML, these "something" are generally vectors.
- By "behave", he simply means that these vectors have different values.
Second, let's assume we have a group of vectors:
- The variance answer the following question: "How different is each vector from the average vector?"
- If all vectors behave similarly, then the variance in the group is 0.
- If they behave differently, then the variance will go up.
Third, intuition.
Let's say you want to split that group of vectors into $N$ classes.
- You want to make sure that vectors are as similar as they can be in each classes. Thus, your first criteria will be to say that you want to minimize variance within classes.
- Now, this is actually not enough; what you also want is that the variance between classes is maximised – you want that gap as big a possible.
- "Differs maximally" means maximise the difference between the variance within class and variance between class.
Fourth, the math.
Consider a group of vectors $x_j$, for $j$ from $1$ to $n$. Let $X$ be the matrix with $x_j$ as its columns: $X^j = x_j$.
Assume we have $K_0$ classes, so that each $j$ belong only to one class. The function $K()$ gives the class of each $j$.
First we need to compute $\mu_k = Var(X^{j|K(j) = k})$
Then let $U$ be the matrix with columns $\mu_k$ : $U^k = \mu_k$
Note that " $j|K(j) = k$ " is a list of index $j$ that verify $K(j) = k$
Then the total variance can always be expressed as $$\text{Var}(X) = \text{Var}(U) + \sum_k \text{Var}(X^{j | K(j)=k}),$$
The total variance can always be decomposed as betweenVariance + withinVariance. Now, this is true, no matter how you have built the classes. In other words, it is true for any $K()$. But Fisher was looking to construct special classes, mainly the ones maximising: $$\text{Var}(U) - \sum_k \text{Var}(X^{j | K(j)=k}).$$ In layman's terms: the class for which $\text{Var}(U) $ differs maximally from $ \sum_k \text{Var}(X^{j | K(j)=k}),$