1

I'm doing a data analysis on data with more than 100 dimensions.

After that different ML-Algorithms like NN are applied to it.

When I do a PCA in the first place to reduce dimensionality to somewhat like 3-10, I persistently get better results (as in less miss-predictions) than without it.

My thought was that PCA should just speed up NN, etc, but not make them better?

Is this improvement realistic or did I make a mistake with my PCA?


This is how I´m doing it concretely:

Data; % training input
Test_Data; % test input
pca_size = 3; % pca size

%Scaling and Centering of Data
Scaled = (Data - mean(Data))./std(Data);

coeff = pca(Scaled);

Data_Reduced = Data * coeff(:, 1:pca_size);
Test_Data_Reduced = Test_Data * coeff(:, 1:pca_size);
amoeba
  • 93,463
  • 28
  • 275
  • 317
SwingNoob
  • 51
  • 5
  • 2
    PCA manages to keep most (if your dataset is structured) of the information in your dataset but condenses it in far less variables. Because of the curse of dimensionality your original dataset is not suited for classification, and because PCA condense that into few variables you end up with better results. Seems good to me. – Riff Jan 25 '17 at 09:05
  • 2
    Reducing the dimensionality of your input reduces the complexity of your estimated neural network. It's quite plausible that if high variance components obtained from PCA are mostly what's important for classification, the [reduction in variance (overfitting) due to a simpler model may dominate the increase in bias](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). – Matthew Gunn Jan 25 '17 at 09:20
  • Added sample code how I´m doing PCA. Thank you so far. – SwingNoob Jan 25 '17 at 09:22
  • See http://stats.stackexchange.com/questions/141864 and http://stats.stackexchange.com/questions/142557. It's all covered there. – amoeba Jan 25 '17 at 09:31
  • The two lines starting with `Data_Reduced = Data * coeff(:, 1:pca_size);` are not correct. – Matthew Gunn Jan 25 '17 at 09:48
  • Why are they not? – SwingNoob Jan 25 '17 at 09:50
  • Or could anyone please tell me if and why the code is wrong? – SwingNoob Jan 25 '17 at 10:44
  • Is this `Scaled = (Data - mean(Data))./std(Data);` a Matlab command or rather a pseudo-code? If `Data` is a 2D matrix, then in Matlab `Data - mean(Data)` will not subtract column means, and dividing by `std(Data)` like that won't work at all. – amoeba Jan 25 '17 at 16:32
  • 1
    It indeed is matlab code. And Data - mean(Data) does subtract the column mean. Just tried that out. Also verified the dividing by std(Data). Works perfectly fine. Calculates standard derivation of each column and divides each colum element by the corresponding standard derivation of that column. – SwingNoob Jan 25 '17 at 17:25
  • Wow!! You are probably using 2016b version, aren't you? I googled and found that 2016b indeed [introduced this functionality](https://nickhigham.wordpress.com/2016/09/20/implicit-expansion-matlab-r2016b/), so it *is* working since September last year. Amazing. I've been using Matlab for years and it *never* worked before and it used to be so annoying. Thanks for letting me know. – amoeba Jan 25 '17 at 19:11
  • In any case, regarding the last two lines: you should multiply your `Scaled` with PCA `coeff`, not the `Data`. And for the test data, you need to scale it first with the mean/std of the training data. – amoeba Jan 25 '17 at 19:12
  • You´re welcome, and yes I am. U had to use some bsxfun before, hadnt u? But does this have a big influence, especially because I am using unscaled versions of both Data and Test_Data, so there is no difference in preparation of both. And PCA is working well. So in sum there is impmrovement potential by multiplying scaled data, but in general my solution is working? – SwingNoob Jan 25 '17 at 20:42
  • Exactly, was always using `bsxfun`. I don't know if it has a big influence or not. What you are doing is not really PCA (because you compute eigenvectors of the scaled data but transform unscaled data), but as you are doing the same thing with train and test then it's okay. Regarding your main question of how it's possible that results are better, please read http://stats.stackexchange.com/questions/141864/ and follow the links in the top answer. (Also, please include `@amoeba` in your comments, otherwise I am not notified of them.) – amoeba Jan 25 '17 at 23:54

0 Answers0