This is based on the example in section 8.3.2 in Izenman's Modern Multivariate Statistical Analysis, using the Wisconsin breast cancer data. This is a binary classification problem, classifying data as either "malignant" or "benign."
Here is the R
code I have been using, which generates the correct coefficients. The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. Please note the formatting of the data in the code below - if you need clarification on this, let me know.
(If the link is down, try https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data.)
library(MASS)
library(dplyr)
# Read and format the file
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
stringsAsFactors = FALSE,
header = FALSE)
colnames(data) <- c(
"ID", "DIAGNOSIS",
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "mv"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "sd"),
paste0(c("radius.", "texture.", "peri.", "area.", "smooth.", "comp.", "scav.", "ncav.", "symt.", "fracd."), "ev"))
# Log-transform values
data[,!(colnames(data) %in% c("ID", "DIAGNOSIS") )] <- apply(X = data %>%
dplyr::select(-ID, -DIAGNOSIS),
FUN = log,
MARGIN = 2)
# Arbitrarily replace -infinities with -500
data <- do.call(data.frame,lapply(data, function(x) replace(x, is.infinite(x),-500)))
# Remove ID column
data <- data %>% select(-ID)
lda(DIAGNOSIS ~., data = data)
The output is:
Call:
lda(DIAGNOSIS ~ ., data = data)
Prior probabilities of groups:
B M
0.6274165 0.3725835
Group means:
radius.mv texture.mv peri.mv area.mv smooth.mv comp.mv scav.mv ncav.mv symt.mv fracd.mv radius.sd
B 2.485875 2.862456 4.345779 6.092424 -2.391093 -2.607534 -21.478838 -21.877138 -1.757494 -2.771984 -1.3278779
M 2.843529 3.057882 4.730647 6.819136 -2.281366 -1.998231 -1.944904 -2.510594 -1.655332 -2.776639 -0.6238094
texture.sd peri.sd area.sd smooth.sd comp.sd scav.sd ncav.sd symt.sd fracd.sd radius.ev texture.ev
B 0.09501073 0.6258005 2.973619 -5.012304 -4.062604 -22.059549 -22.733882 -3.934377 -5.795603 2.582422 3.131397
M 0.12148354 1.3337622 4.063965 -5.056042 -3.572136 -3.290033 -4.256156 -3.972701 -5.614785 3.031062 3.361176
peri.ev area.ev smooth.ev comp.ev scav.ev ncav.ev symt.ev fracd.ev
B 4.453511 6.280822 -2.092602 -1.827802 -20.2208063 -20.786509 -1.320487 -2.546305
M 4.930661 7.179920 -1.943383 -1.083191 -0.8842608 -1.740325 -1.153316 -2.416048
Coefficients of linear discriminants:
LD1
radius.mv -30.51574995
texture.mv -0.37242169
peri.mv 29.52376437
area.mv 0.94255639
smooth.mv 0.03451320
comp.mv -1.56939188
scav.mv 1.82685413
ncav.mv 0.25593131
symt.mv -1.18699860
fracd.mv -3.96712759
radius.sd -2.50553731
texture.sd -0.55183996
peri.sd 0.46892773
area.sd 3.10582705
smooth.sd 0.08061433
comp.sd -0.14182425
scav.sd -0.17481014
ncav.sd 0.53816835
symt.sd -0.52520119
fracd.sd -0.50005122
radius.ev 6.36294870
texture.ev 2.25899352
peri.ev -3.11083922
area.ev -2.08205924
smooth.ev 2.02715071
comp.ev 0.33358054
scav.ev -1.32315770
ncav.ev -1.11897286
symt.ev 2.88331787
fracd.ev 4.16723706
I am interested in replicating these coefficients using matrix multiplications and algebraic manipulations (i.e., "from scratch"). I assume these coefficients are correct (they are very close to what Izenman has in his text), hence this is more of a question about the implementation of the statistical theory than the R
code.
Here is my attempt at the theoretical derivation of these coefficients: my understanding is that they are given by the matrix $$\hat{\mathbf{b}} \equiv \hat{\boldsymbol\Sigma}^{-1}_{XX}(\bar{\mathbf{x}}_1 - \bar{\mathbf{x}}_2)$$ where $$\bar{\mathbf{x}}_i = n_i^{-1}\sum_{j=1}^{n_i}\mathbf{x}_{ij}$$ for $i = 1, 2$ is the vector of means ($\{\mathbf{x}_{ij}, j = 1, \dots, n_i\}$ are the samples placed in class $i$), and $$\hat{\boldsymbol\Sigma}_{XX} = n^{-1}\mathbf{S}_{XX}$$ where $$\mathbf{S}_{XX} = \mathbf{S}_{XX}^{(1)}+\mathbf{S}_{XX}^{(2)}$$ with $$\mathbf{S}_{XX}^{(i)} = \sum_{j=1}^{n_i}(\mathbf{x}_{ij}-\bar{\mathbf{x}}_i)(\mathbf{x}_{ij}-\bar{\mathbf{x}}_i)^{T}$$ for $i = 1, 2$.
data_B <- data %>% filter(DIAGNOSIS == "B") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
data_M <- data %>% filter(DIAGNOSIS == "M") %>% dplyr::select(-DIAGNOSIS) %>% as.matrix()
mean_B = colMeans(data_B)
mean_M = colMeans(data_M)
Sigma_XX = (cov(data_B)*(nrow(data_B)-1) + cov(data_M)*(nrow(data_M)-1))/(nrow(data)-2)
solve(Sigma_XX) %*% (mean_M - mean_B)
With the output:
[,1]
radius.mv -116.0451850
texture.mv -1.4162439
peri.mv 112.2728658
area.mv 3.5843501
smooth.mv 0.1312467
comp.mv -5.9680778
scav.mv 6.9471544
ncav.mv 0.9732547
symt.mv -4.5139140
fracd.mv -15.0861786
radius.sd -9.5280483
texture.sd -2.0985350
peri.sd 1.7832367
area.sd 11.8108280
smooth.sd 0.3065599
comp.sd -0.5393287
scav.sd -0.6647674
ncav.sd 2.0465447
symt.sd -1.9972332
fracd.sd -1.9015930
radius.ev 24.1969986
texture.ev 8.5904925
peri.ev -11.8298883
area.ev -7.9176475
smooth.ev 7.7088415
comp.ev 1.2685389
scav.ev -5.0316994
ncav.ev -4.2552260
symt.ev 10.9646709
fracd.ev 15.8471542
What am I doing wrong?
Note: You'll notice I did the division by subtracting 2 from $n = 569$ to get $567$ to make $\hat{\boldsymbol\Sigma}$ an unbiased estimator. This shouldn't make much of a difference. My coefficients are off from the LDA MASS
coefficients by a factor of around $3.8$. It is not clear to me why this is.