2

Elements of Statistical Learning 4.3.2 elaborates on computation for Linear Discriminant Analysis. https://web.stanford.edu/~hastie/Papers/ESLII.pdf

Procedure is said to be

• Sphere the data with respect to the common covariance estimate $\hat{Σ}: X∗ ← D^{−1/2}U^{T}X$, where $\hat{Σ} = UD U^{T}$. The common covariance estimate of $X^{∗}$ will now be the identity.

• Classify to the closest class centroid in the transformed space, modulo the effect of the class prior probabilities $π_{k}$.

How is the procedure above derived from the expression of the discriminant functions which writes in the case of identical covariance matrices for all $k$ :

$\delta_{k}(x)=x^{T}\Sigma^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k}+\log{\mu_{k}}$

kiriloff
  • 553
  • 4
  • 15

1 Answers1

3

Sphering ( or whitening ) the data ($X$) means applying a transformation so that in the new basis, the covariance for sphered data ($X^{*}$) is the identity matrix, i.e. $E[X^{*T}X^{*}]=I_{n}$ .

We operate this transformation to obtain significantly simpler computation. As mentioned in 4.3.2 the ingredients of $\delta_{k}(x)$ are

$(x − \hat{\mu_{k}})^{T}\hat{\Sigma}_{k}^{-1}(x − \hat{\mu_{k}}) = [U^{T}_{k} (x − \hat{\mu_{k}})]^{T}D_{k}^{-1}[U^{T}_{k} (x − \hat{\mu_{k}})]$

$\log{|\hat{\Sigma}_{k}|}=\sum_{l}\log{d_{kl}}$

where $\hat{\Sigma}_{k}=U_{k}D_{k}U_{k}^{T}$ is the eigen-decomposition for each $\hat{\Sigma}_{k}$, $U_{k}$ is $p \times p$ orthonomal and $D_{k}$ a diagonal matrix of positive eigenvalues $d_{kl}$.

Let's write the sphering of the data :

$[U^{T}_{k} (x − \hat{\mu_{k}})]^{T}D_{k}^{-1}[U^{T}_{k} (x − \hat{\mu_{k}})]$

$=(U^{T}_{k}x)^{T}D_{k}^{-1/2}D_{k}^{-1/2}U^{T}_{k}x + (U^{T}_{k}\hat{\mu}_{k})^{T}D_{k}^{-1/2}D_{k}^{-1/2}U^{T}_{k}\hat{\mu}_{k}-(U^{T}_{k}\hat{\mu}_{k})^{T}D_{k}^{-1/2}D_{k}^{-1/2}U^{T}_{k}x-(U^{T}_{k}x)^{T}D_{k}^{-1/2}D_{k}^{-1/2}U^{T}_{k}\hat{\mu}_{k}$

We operate the suggested change of variables, $X^{*}\leftarrow D^{-1/2}U^{T}X$ and similarly $\hat{\mu}_{k}^{*}\leftarrow D^{-1/2}U^{T}\hat{\mu}_{k}$. The previous calculation transforms into

$(x^{*}-\hat{\mu}_{k}^{*})^{T}(x^{*}-\hat{\mu}_{k}^{*})=\|x^{*}-\hat{\mu}_{k}^{*}\|^{2}$

Minimizing the discriminant needs to minimize this quantity, that is to say finding the class $k$ minimizing the distance between data and the centroid of class $k$ in the new base.

The last term in $\delta_{k}(x)$ is $\log{\mu_{k}}$ hence the mention on the influence of prior probability $\mu_{k}$.

kiriloff
  • 553
  • 4
  • 15