3

Does linear discriminant analysis always project the points to a line? Most of the graphical illustrations of LDA that I see online use an example of 2 dimensional points which are projected onto a straight line y=mx+c. If the points were each a 10-dimensional vector, does LDA still project them to a line?

Or would it project them to a hyperplane with 9 dimensions or less.

ANother question about projections: If I have a vector Y=[a,b,c,d]. The projection of this vector onto a given line is the product of the direction vector V of the line and the vector Y. This is equivalent to a dot product given by transpose(V).Y, and gives just one number (a scalar).

This seems to be the way how LDA works. So, if I may ask, does LDA map a full n-dimensional vector onto a scalar (a singe number)?

Apologies in advance for my newbie question.

Minaj
  • 31
  • 1
  • 2

2 Answers2

3

LDA seeks to reduce dimensionality while preserving as much of the class discriminatory information as possible. Assume we have a set of $d$-dimensional observations $X$, belonging to $C$ different classes. The goal of LDA is to find an linear transformation (projection) matrix $L$ that converts the set of labelled observations $X$ into another coordinate system $Y$ such that the class separability is maximized. The dataset is transformed into the new subspace as:

\begin{equation} Y = XL \end{equation}

The columns of the matrix $L$ are a subset of the $C-1$ largest (non-orthogonal) eigenvectors of the squared matrix $J$, obtained as:

\begin{equation} J = S_{W}^{-1} S_B \end{equation}

where $S_W$ and $S_B$ are the scatter matrices within-class and respectively between-classes.

When it comes to dimension reduction in LDA, if some eigenvalues have a significantly bigger magnitude than others then we might be interested in keeping only those dimensions, since they contain more information about our data distribution. This becomes particularly interesting as $S_B$ is the sum of $C$ matrices of rank $\leq 1$, and the mean vectors are constrained by $\frac{1}{C}\sum_{i=1}^C \mu_i = \mu$ \cite{c.radhakrishnarao1948}. Therefore, $S_B$ will be of rank $C-1$ or less, meaning that there are only $C-1$ eigenvalues that will be non-zero (more info here). For this reason, even if the dimensionality $k$ of the sub-space $Y$ can be arbitrarily chosen, it does not make any sense to keep more than $C-1$ dimensions, as they will not carry any useful information. In fact, in \ac{lda} the smallest $d - (C-1)$ dimensions have magnitude zero, and therefore the subspace $Y$ should have exactly $k = C-1$ dimensions.

Renthal
  • 326
  • 1
  • 7
  • is that correct that: `Let say my original dataset has 2 classes, the output will be 1 dimensionality ( 2 – 1 =1 )`, likewise, if `my original dataset has 5 classes, the output will be 4 dimensionality.` – aan May 08 '20 at 16:28
  • If you choose `L` to contain only the non-zero eigenvectors (meant as those eigenvectors whose corresponding eigenvalue is non-zero), yes, correct. – Renthal May 11 '20 at 08:05
  • thanks. So I can chooce any output I want for LDA, but the problem is the eigenvalues for dimensions `> Class - 1` will be `imaginary or zero` (eigenpair) which is no meaning. Is that correct? – aan May 11 '20 at 10:37
  • 1
    Your definition is imprecise. You can yes choose any output you want (provided it is smaller or equal to $d$, with the notation above), however, the linear separability of a space with dimension $x$ such that $C-1 < x \leq d$ is not going to be any better than another space with dimension $y = C-1$. The eigervectors with zero (not sure where you get the imaginary part into play?) eigenvalue are in the $J$ matrix, not in in the final space $Y$. Hope it helps. – Renthal May 12 '20 at 11:52
  • thanks. can you explain in simple English. Couldn't understand it deeply. But for better separation, is the best to have output is `C-1`. Am I correct? – aan May 12 '20 at 19:13
  • For better separation is best to have output to be $C - 1$. Having more, is not harming but also not helping either. – Renthal May 14 '20 at 12:05
  • thanks. How can I explained or any proof reference I can said `Having more than C-1 will not helping in separation`? – aan May 14 '20 at 12:08
  • 1
    Because you have only $C-1$ non-zero eigenvalues in matrix $J$. – Renthal May 14 '20 at 12:27
  • thanks a lot. very helpful reply. – aan May 14 '20 at 12:33
  • are you familiar with standarised a data https://stats.stackexchange.com/questions/466460/what-is-the-meaning-of-standardization-in-lda-fda – aan May 14 '20 at 13:49
  • can i get the full reference for `\cite{c.radhakrishnarao1948}.` which in your text above? I couldn't find this paper. – aan May 23 '20 at 09:18
  • 1
    The Utilization of Multiple Measurements in Problems of Biological Classification, C. Radhakrishna Rao, Journal of the Royal Statistical Society. Series B (Methodological), 1948, http://www.jstor.org/stable/2983775 – Renthal May 25 '20 at 07:53
  • thanks for the references. Are you familiar with LDA having small sample size problem? I would be appreciate if you can advice here https://stats.stackexchange.com/questions/468095/linear-discriminant-analysis-have-small-sample-size-problem-sss-is-it-nd – aan May 25 '20 at 08:39
2

LDA project to (at most) $n_{classes} - 1$ dimensions, so binary (2-class) LDA reduces to 1D (= onto line).
10 classes would lead to a 9D projection (as long as X is at least 9D, of course).

  does LDA map a full n-dimensional vector onto a scalar (a singe number)? Not always, see above.

For more details on what the projection step does, see e.g. https://stats.stackexchange.com/a/87509/4598

(Obviously, if you code your classes as numbers then the final class prediction will be a single number)

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133