Difference in scaling of Linear Discriminant Analysis coefficients between manual calculation and R

Question

This is based on the example in section 8.3.2 in Izenman's Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either "malignant" or "benign."

Here is the R code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting of the data in the code below - if you need clarification on this, let me know.

(If the link is down, try https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data.)

library(MASS)
library(dplyr)

# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
                 stringsAsFactors = FALSE,
                 header = FALSE)
colnames(data) <- c(
  "ID", "DIAGNOSIS", 
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
  paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))

# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>% 
                                        dplyr::select(-ID, -DIAGNOSIS),
                                      FUN = log,
                                      MARGIN = 2)

# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))

# Remove ID column
data <- data %>% select(-ID)

lda(DIAGNOSIS ~., data = data)

The output is:

Call:
lda(DIAGNOSIS ~ ., data = data)

Prior probabilities of groups:
        B         M 
0.6274165 0.3725835 

Group means:
  radius.mv texture.mv  peri.mv  area.mv smooth.mv   comp.mv    scav.mv    ncav.mv   symt.mv  fracd.mv  radius.sd
B  2.485875   2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M  2.843529   3.057882 4.730647 6.819136 -2.281366 -1.998231  -1.944904  -2.510594 -1.655332 -2.776639 -0.6238094
  texture.sd   peri.sd  area.sd smooth.sd   comp.sd    scav.sd    ncav.sd   symt.sd  fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603  2.582422   3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136  -3.290033  -4.256156 -3.972701 -5.614785  3.031062   3.361176
   peri.ev  area.ev smooth.ev   comp.ev     scav.ev    ncav.ev   symt.ev  fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191  -0.8842608  -1.740325 -1.153316 -2.416048

Coefficients of linear discriminants:
                    LD1
radius.mv  -30.51574995
texture.mv  -0.37242169
peri.mv     29.52376437
area.mv      0.94255639
smooth.mv    0.03451320
comp.mv     -1.56939188
scav.mv      1.82685413
ncav.mv      0.25593131
symt.mv     -1.18699860
fracd.mv    -3.96712759
radius.sd   -2.50553731
texture.sd  -0.55183996
peri.sd      0.46892773
area.sd      3.10582705
smooth.sd    0.08061433
comp.sd     -0.14182425
scav.sd     -0.17481014
ncav.sd      0.53816835
symt.sd     -0.52520119
fracd.sd    -0.50005122
radius.ev    6.36294870
texture.ev   2.25899352
peri.ev     -3.11083922
area.ev     -2.08205924
smooth.ev    2.02715071
comp.ev      0.33358054
scav.ev     -1.32315770
ncav.ev     -1.11897286
symt.ev      2.88331787
fracd.ev     4.16723706

I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., "from scratch"). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the R code.

Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix $$\hat{\mathbf{b}} \equiv \hat{\boldsymbol\Sigma}^{-1}_{XX}(\bar{\mathbf{x}}_1 - \bar{\mathbf{x}}_2)$$ where $$\bar{\mathbf{x}}_i = n_i^{-1}\sum_{j=1}^{n_i}\mathbf{x}_{ij}$$ for $i = 1, 2$ is the vector of means ($\{\mathbf{x}_{ij}, j = 1, \dots, n_i\}$ are the samples placed in class $i$), and $$\hat{\boldsymbol\Sigma}_{XX} = n^{-1}\mathbf{S}_{XX}$$ where $$\mathbf{S}_{XX} = \mathbf{S}_{XX}^{(1)}+\mathbf{S}_{XX}^{(2)}$$ with $$\mathbf{S}_{XX}^{(i)} = \sum_{j=1}^{n_i}(\mathbf{x}_{ij}-\bar{\mathbf{x}}_i)(\mathbf{x}_{ij}-\bar{\mathbf{x}}_i)^{T}$$ for $i = 1, 2$.

data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()

mean_B = colMeans(data_B)
mean_M = colMeans(data_M)

Sigma_XX = (cov(data_B)*(nrow(data_B)-1) + cov(data_M)*(nrow(data_M)-1))/(nrow(data)-2)
solve(Sigma_XX) %*% (mean_M - mean_B)

With the output:

                   [,1]
radius.mv  -116.0451850
texture.mv   -1.4162439
peri.mv     112.2728658
area.mv       3.5843501
smooth.mv     0.1312467
comp.mv      -5.9680778
scav.mv       6.9471544
ncav.mv       0.9732547
symt.mv      -4.5139140
fracd.mv    -15.0861786
radius.sd    -9.5280483
texture.sd   -2.0985350
peri.sd       1.7832367
area.sd      11.8108280
smooth.sd     0.3065599
comp.sd      -0.5393287
scav.sd      -0.6647674
ncav.sd       2.0465447
symt.sd      -1.9972332
fracd.sd     -1.9015930
radius.ev    24.1969986
texture.ev    8.5904925
peri.ev     -11.8298883
area.ev      -7.9176475
smooth.ev     7.7088415
comp.ev       1.2685389
scav.ev      -5.0316994
ncav.ev      -4.2552260
symt.ev      10.9646709
fracd.ev     15.8471542

What am I doing wrong?

Note: You'll notice I did the division by subtracting 2 from $n = 569$ to get $567$ to make $\hat{\boldsymbol\Sigma}$ an unbiased estimator. This shouldn't make much of a difference. My coefficients are off from the LDA MASS coefficients by a factor of around $3.8$. It is not clear to me why this is.

Additionally, I might recommend you to go though answers about LDA algebra already found on this site. There are several. Some are about 2-class situation and some ([mine](https://stats.stackexchange.com/q/48786/3277) in particular) are about general k-class situation. — ttnphns, Oct 19 '17 at 17:33
Check also [this](https://stats.stackexchange.com/q/169436/3277) which explains how you can obtain LDA coefficients from linear regression or (canonical correlation) coefficients. — ttnphns, Oct 19 '17 at 17:38
Here is https://stats.stackexchange.com/q/82497/3277 lda of iris data. You could do with your code, `mass` and compare results with those. — ttnphns, Oct 19 '17 at 18:12
@amoeba, I'm sorry to no have time. If you helped the OP out you might want to issue an answer (and it'll be upvoted). The best way for you would be to do an example of calculations with _iris_ dataset (if k=2 classes is necessary, you could select 2 classes there). Perhaps then I come up to confirm it with SPSS (I won't be using R or R data for the current case.) — ttnphns, Oct 24 '17 at 13:40

amoeba · Accepted Answer · 2017-10-26T12:15:26.860

You have correctly computed $$\mathbf {b} = \mathbf{S}_W^{-1}(\boldsymbol\mu_1-\boldsymbol\mu_2),$$ where $\boldsymbol\mu_i$ are class means and $\mathbf{S}_W$ is the within-class pooled covariance matrix. This is indeed the LDA solution for two classes, as stated in the Izenman's textbook you referenced, as well as in The Elements of Statistical Learning (Section 4.3, Eq. 4.11). The separation line between two classes is orthogonal to the "LDA axis" $\mathbf{b}$, or in other words any point $\mathbf x$ can be projected onto this LDA axis as $\mathbf x\mathbf b^\top$ and then classified depending on the projection.

This equation for $\mathbf b$ is derived in Izenman as well as in The Elements via considering the likelihood ratio of two Gaussians. But there is another way to arrive to the same LDA axis: it can be introduced as the axis maximizing the ratio of between-class and within-class vairances (aka signal-to-noise ratio): $$\mathbf b = \operatorname*{argmax}_\mathbf{a} \frac{\mathbf{a}^\top\mathbf{S}_B\mathbf{a}}{\mathbf{a}^\top\mathbf{S}_W\mathbf{a}}.$$ Here $\mathbf{S}_B$ is between-class covariance matrix, defined in exactly the same way as within-class covariance matrix but instead of the deviations of data points from the class means, we take deviations of class means from the grand mean.

The solution to maximizing this ratio is given by the first eigenvector of $\mathbf S_W^{-1}\mathbf S_B$, and one can easily show that it's the same $\mathbf b$ as in the first equation above.

Now comes the punch-line! The signal-to-noise ratio does not depend on the length of $\mathbf{a}$, because it enters both the nominator and denominator. So when trying to find the maximum, one can fix the length as one pleases, and it turns out that it is convenient to fix it such that $\mathbf{a}^\top\mathbf{S}_W\mathbf{a}=1$ (see The Elements, Section 4.33, Eqs. 4.15-4.16).

Getting back to your code, after you computed

b = solve(Sigma_XX) %*% (mean_M - mean_B)

you can check the value of $\mathbf{b}^\top\mathbf{S}_W\mathbf{b}$:

t(b) %*% Sigma_XX %*% b

and it is equal to 14.46126. Lo and behold, its square root is 3.802796, and normalizing by it, we arrive to the lda() output:

b_lda = b / drop(sqrt(t(b) %*% Sigma_XX %*% b))

Double-check:

> drop(t(b_lda) %*% Sigma_XX %*% b_lda)
[1] 1

Izenman writes in the text that his table lists coefficients as obtained by the first formula above. This is wrong. He must have computed the coefficients via some existing LDA implementation and did not think of this normalization issue.

I'm glad to upvote this answer. It has been noted in several places on this site already, by me too, that LDA /w 2 classes, Fisher's LDA, (a) can be computed "shorthand"ly w/o explicitly decomposing of the ratio B/W which is $Sw^{-1}Sb$, and, also, (b) is equivalent to linear regression with the dichotomous DV. The equivalency of both to the general (k class, canonical) LDA is proportional: cosine similarity between the coefficients after the three procedures are 1. — ttnphns, Oct 26 '17 at 14:11
(cont) The magnitude of discr. coefficients determine the value of variance of the discriminant which, unlike of a pr. component, is of minor informational asset because it is, implicitly, the variance from an abstract _ratio_ in the data, not of data values themselves. Next to say, even with k-class LDA (and with canonical corr. analysis) different programs may compute coefficients and/or discr. scores differently (but linearly equivalently). Most convenient way is to compute coeffs and hence scores such that the pooled within group cov matrix of discriminants is identity matrix. — ttnphns, Oct 26 '17 at 14:12

Difference in scaling of Linear Discriminant Analysis coefficients between manual calculation and R

1 Answers1

Linked

Related