Why is my LDA performance a non-monotonic function of the amount of training data?

Question

Short story:

I have a classification pipeline consisting of some feature extractors and an LDA classifier. When evaluating the pipeline in a cross-validation I get a decent test accuracy of 94% (for 19 classes). However when evaluating the test accuracy for different amounts of training data I get weird results: the test accuracy overall increases with more training data (which is expected), but for a certain amount of training data the accuracy collapses (see plot):

The plot shows the test accuracy for the pipeline (y-axis) versus the number of samples per class used for training (x-axis). For each of the 19 classes there are 50 training samples. Testing was done on all samples that were not used for training.

The negative peak at 10-11 makes absolutely no sense to me. Can anyone explain this?

The long story:

To make sure this is not a random effect I ran the test 40 times for different, random permutations of the training samples. This is the variance shown in the plot. In addition I repeated the test using only subsets of the features. The results look the same, just the position of the negative peak is different, see the following plots:

(Using only 110 of the 176 features)

(Using only the other 66 features)

I noticed that the position of the peak changes with the amout of features, so I ran the test again for 11 different amounts of features. This results in the following plot: (x-axis is the same as above. y-axis is the size/16 of the selected featureset, so e.g. 5.0 means 80 features The color is the test accuracy)

This suggests that there is a direct relation between the position of the peak and the number of features.

Some further points:

I have quite a lot features (176) for only 19 classes. For sure this is not optimal, but it my opinion it does not explain that peak.
Some of the variables are collinear (actually a lot of them are). Again, not perfect but does not explain the peak.

Edit: Here is a plot of the training accuracy:

Edit2:

Here is some data. Its calculated using the following settings:

K = 19 (number of classes)
p = 192 (number of features)
n = 2 to 19 (number of training samples PER CLASS, train/test split is a stratified random split)

The outputs are:

n: number of samples per class
test data: shape of the test data fed into lda (num samples x num features)
train data: shape of the train data fed into lda (num samples x num features)
test and training accuracy
rank: obtained by printing rank defined in https://github.com/scikit-learn/scikit-learn/blob/a95203b/sklearn/lda.py#L369

->

n   test data   train data  test acc    train acc   rank
 2  (912, 192)  (38, 192)        0.626       0.947  19
 3  (893, 192)  (57, 192)        0.775       1.000  38
 4  (874, 192)  (76, 192)        0.783       1.000  57
 5  (855, 192)  (95, 192)        0.752       1.000  76
 6  (836, 192)  (114, 192)       0.811       1.000  95
 7  (817, 192)  (133, 192)       0.760       1.000  114
 8  (798, 192)  (152, 192)       0.786       1.000  133
 9  (779, 192)  (171, 192)       0.730       1.000  152
10  (760, 192)  (190, 192)       0.532       1.000  171
11  (741, 192)  (209, 192)       0.702       1.000  176
12  (722, 192)  (228, 192)       0.727       1.000  176
13  (703, 192)  (247, 192)       0.856       1.000  176
14  (684, 192)  (266, 192)       0.857       1.000  176
15  (665, 192)  (285, 192)       0.887       1.000  176
16  (646, 192)  (304, 192)       0.881       1.000  176
17  (627, 192)  (323, 192)       0.896       1.000  176
18  (608, 192)  (342, 192)       0.913       1.000  176
19  (589, 192)  (361, 192)       0.900       1.000  176
20  (570, 192)  (380, 192)       0.916       1.000  176
21  (551, 192)  (399, 192)       0.907       1.000  176
22  (532, 192)  (418, 192)       0.929       0.995  176
23  (513, 192)  (437, 192)       0.916       0.995  176
24  (494, 192)  (456, 192)       0.909       0.991  176
25  (475, 192)  (475, 192)       0.947       0.992  176
26  (456, 192)  (494, 192)       0.928       0.992  176
27  (437, 192)  (513, 192)       0.927       0.992  176
28  (418, 192)  (532, 192)       0.940       0.992  176
29  (399, 192)  (551, 192)       0.952       0.991  176
30  (380, 192)  (570, 192)       0.934       0.989  176
31  (361, 192)  (589, 192)       0.922       0.992  176
32  (342, 192)  (608, 192)       0.930       0.990  176
33  (323, 192)  (627, 192)       0.929       0.989  176
34  (304, 192)  (646, 192)       0.947       0.986  176
35  (285, 192)  (665, 192)       0.940       0.986  176
36  (266, 192)  (684, 192)       0.940       0.993  176
37  (247, 192)  (703, 192)       0.935       0.989  176
38  (228, 192)  (722, 192)       0.939       0.985  176
39  (209, 192)  (741, 192)       0.923       0.988  176

176 variables! How can your LDA work with n*19<=176 cases, n being the number of cases per class? Maybe you are doing some regularized version of the analysis; then the peak could be somehow connected with that... — ttnphns, Nov 30 '17 at 19:22
I am using the sklearn implementation with the svd solver and no shrinkage. I agree that 176 variables is a lot, but why does it work with even less training samples? (In the first plot >80% acc is reached with only 5 samples per class). — Johannes, Nov 30 '17 at 21:34
Standard LDA cannot be solved with cases less than variables, of singularity; so I suppose your program is doing some trick to bypass the problem, and that your strange finding may be linked with that, as I've suggested. Anyway, I would't just do an analysis with so few cases relative the dimensionality. — ttnphns, Nov 30 '17 at 22:06
The documentation does not say anything about it. But I will take a look at the source code tomorrow. Thanks. If anyone else knows the sklearn implementation, feel free to step in. — Johannes, Nov 30 '17 at 22:38
There must be some mechanism in vanilla sklearn LDA, which automatically deals with collinear features, as ttnphns layed out. The link with your strange findings is that when you increase the training-size, you are actually not improving the fit but making it worse. Your procedure is not not only increasing training size but also the number of selected features (adding less discriminating features that will increase over-fitting and make the predictions worse). Only once you get the total training set surpassing the number of features the increase in the training set will improve the fit. — Sextus Empiricus, Dec 04 '17 at 10:18
Note that your minimum occurs exactly when the sample size in the training data set equals your amount of features. — Sextus Empiricus, Dec 04 '17 at 10:21
@MartijnWeterings Is it really "exactly"? The minimum seems to be at x=11 points per class, which for 19 classes makes training sample size n=11*19=209 which is more than 176 features. — amoeba, Dec 05 '17 at 09:17
+1 Very interesting phenomenon. I have an answer explaining how SVD solver in scikit works: https://stats.stackexchange.com/a/110697/28666. I looked in the code now https://github.com/scikit-learn/scikit-learn/blob/a95203b/sklearn/lda.py and it seems that when within-class covariance needs to be inverted, they only use non-zero eigenvalues for it (cc @ttnphns). I have no intuition about how this can affect the LDA results... — amoeba, Dec 05 '17 at 09:21
@ttnphns I am not 100% sure but I *think* I might have figured it out. See my answer. — amoeba, Dec 05 '17 at 10:35
@MartijnWeterings I *think* I might have figured it out. See my answer. Your comment about the approximate position of the minimum helped me understand what is going on. — amoeba, Dec 05 '17 at 10:36

amoeba · Accepted Answer · 2017-12-05T20:58:07.043

7

You discovered an interesting phenomenon.

LDA computations rely on inverting within-class scatter matrix $\mathbf S_W$. Usually LDA solution is presented as an eigenvalue decomposition of $\mathbf S_W^{-1}\mathbf S_B$, but scikit-learn never explicitly computes the scatter matrices and instead uses SVD of data matrices to compute the same thing. This is similar to how PCA is often computed directly via SVD of $\mathbf X$ without ever computing the covariance matrix. What scikit-learn does, is to compute SVD of the between-class data $\mathbf X_B$ transformed with $\mathbf S_W^{-1/2}$ ("whitened with respect to within-class covariance"). And to compute $\mathbf S_W^{-1/2}$, they do SVD of the within-class data $\mathbf X_W$.

In the $n<p$ situation the covariance matrix of $\mathbf X_W$ is not full rank, has some zero eigenvalues, and cannot be inverted. What happens in scikit in this case is that they simply use only non-zero singular values for the inversion (github link).

In other words, they implicitly do PCA of the within-class data, keep only non-zero PCs, and then do LDA on that.

The question is now, how should we expect this to affect the overfitting? Let's consider the same setting as in your question (but backwards), when the total sample size $N$ is decreasing starting from $N\gg p$ all the way to $N \ll p$. Let $n$ be the sample size per class, so that $N=nK$ where $K$ is the number of classes.

In the limit of large sample sizes PCA step has no effect (all PCs are used), overfitting is reduced to zero, and the out-of-sample (e.g. cross-validated) performance should be the best.
For $N\approx p$, the covariance matrix is already full rank (so PCA step has no effect) but the smallest eigenvalues are very noisy and LDA will badly overfit. Performance can drop almost to zero.
For $N<p$, PCA step becomes crucial. Only $N-K$ PCs are non-zero; so dimensionality gets reduced to $N-K$. What happens now depends on whether some of the leading PCs have good discriminatory power. They do not have to, but it is often the case that they do.
- If so, then only using a few leading PCs should work reasonably well. PCA serves as a regularization step and improves the performance.
  
  Of course here no dimensionality reduction is performed by PCA because all available components are kept. So it is not a priori clear that it would improve the performance but as we see it does, at least in this case.
- However, if $n$ is so small that some of the important PCs cannot be estimated and are left out, then the performance should decrease again.

I don't think I have seen this discussed anywhere in the literature, but this is my understanding of this curious curve:

Note that it is a bit of an artifact of how scikit-learn (with svd solver) deals with $N<p$ situation.

Update: We can predict the position of the minimum as follows. Within-class covariance matrix in each class has at most rank $n-1$, and so the pooled within-class covariance has at most rank $(n-1)K$. The minimum should occur when it becomes full rank (and PCA stops having any effect), i.e. for the smallest $n$ such that $(n-1)K>p$: $$n_\mathrm{min} = \Big\lceil\frac{p+K}{K}+1\Big\rceil.$$ This seems to fit perfectly to all your figures.

edited Dec 05 '17 at 20:58

answered Dec 05 '17 at 10:33

amoeba

93,463
28
275
317

Amoeba, you have identified the implicit reduction of the features. But like you commented I had wrongly identified the local minimum at the point n=p, and it occurs more like a bit beyond that point. How can we explain the noise being larger when n is slightly larger than p, roughly when n ~ 19/16 p? (I imagine that possibly the variable train_size_per_class may have been miscalculated due to a typo in the code, but if that is not the case then this effect has something less trivial) – Sextus Empiricus Dec 05 '17 at 11:13
1

@MartijnWeterings I think the sample size has to be computed as $n-1$ instead of $n$ in each class because of within-class centering (covariance matrix will at most have rank $n-1$), meaning the pooled within-class covariance matrix rank with $n=11$ per class will be 10*19 = 190. That's already quite close to $p=176$. For $n=10$ the rank should be 9*19=171 which is less than 176 so one PC will be kicked out. My hypothesis is that the minimum happens for the smallest $n$ such that $(n-1)K>p$ where $K$ is the number of classes... (In this comment $n$ stands for the sample size PER CLASS.) – amoeba Dec 05 '17 at 11:53
I updated my answer to include this computation. – amoeba Dec 05 '17 at 12:29
Your explanation sounds plausible. Indeed when you look at the plot of the training accuracy (I added it to my post) the overfitting decreases for large $n$. – Johannes Dec 05 '17 at 12:46
@Johannes You can double-check by looking at the shape of `scalings_` output of `LDA`. It's the whitening matrix built from pooled within-class data (or maybe some related transformation matrix). If I am right, it should have $(n-1)K$ rows if that's less than $p$. – amoeba Dec 05 '17 at 14:02
@amoeba I printed the `scalings_` attribute of the LDA for all the iterations, and its shape is always $p \times K-1$. – Johannes Dec 05 '17 at 15:49
$\rightarrow$ I have added some more accurate data to my post. – Johannes Dec 05 '17 at 16:06
@Johannes Strange. Even the docs http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html say that `scalings_ : array-like, shape (rank, n_classes - 1)` so that there should be `rank` rows, and if I understand the code correctly, `rank` is the rank of within-class data, should not be larger than $(n-1)K$. – amoeba Dec 05 '17 at 16:08
@Johannes Ah, no, scratch that. I see that in line 384 https://github.com/scikit-learn/scikit-learn/blob/a95203b/sklearn/lda.py#L384 `rank` is redefined via between-class data. OK, then I don't see anything in the output of `LDA()` that would allow to check the rank of within-class data. Anyway, I still think my explanation is correct. – amoeba Dec 05 '17 at 16:12
1

@amoeba You were right (see my post) – Johannes Dec 05 '17 at 16:35
@amoeba, I don't follow your "Update": you seem to be speaking of _df_ s of the covariance matrices but are calling it "rank". The rank of the class cov and the pooled cov matrices is `p`. – ttnphns Dec 06 '17 at 15:49
@ttnphns Yes, Xw is computed as X with cases centered by their class means. So that its scatter matrix is equal to the sums of scatter matrices from each class. Regarding "rank", no: if some class has 10 points in 100 dimensions, then the rank of its covariance matrix is 9, not 100. – amoeba Dec 06 '17 at 16:30
@amoeba, ok, maybe you ought to write the rank is min(df,d), for clarity, or notify that you are speaking just of n
– ttnphns Dec 06 '17 at 16:36
@ttnphns. I will edit to clarify. – amoeba Dec 06 '17 at 16:37
1

@ttnphns Some time ago I described this procedure in more detail in https://stats.stackexchange.com/a/110697. Take a look there. – amoeba Dec 06 '17 at 16:50
Hmmmm. Actually this might be related to https://stats.stackexchange.com/questions/328630. I should think about it after we clarify all the issues over there. – amoeba Feb 21 '18 at 16:24

Why is my LDA performance a non-monotonic function of the amount of training data?

1 Answers1