Adding more features improves the variance explained by PCA but the prediction model performs worse

Question

With my 37 base features, I obtain a PCA 7 components whose explained variance is 62584.661. Below is my code (I use scikit-learn):

>> pca = decomposition.PCA()
>> X = set_1.to_dataframe().drop(['user_id', 'TARGET'], axis=1)
>> pca.fit(X)

>> print pca.explained_variance_

  [6.14280876e+04   4.54217662e+01   2.72989436e+01   2.67631322e+00
   1.82282196e+00   1.21712136e+00   1.14632304e+00   9.78220983e-01
   5.29859226e-01   4.46864198e-01   2.20896621e-01   1.35477040e-01
   1.24036813e-01   1.16983213e-01   9.77577393e-02   9.52879856e-02
   6.33013163e-02   5.11503915e-02   4.46737594e-02   4.17721621e-02
   3.62836939e-02   2.81101420e-02   2.40227513e-02   1.73247899e-02
   9.59248897e-03   5.97952400e-03   2.32278409e-03   1.53377399e-03
   1.13924785e-03   3.69972626e-04   5.94762833e-28   5.94762833e-28
   5.94762833e-28   5.94762833e-28   5.94762833e-28]

 >> pca.explained_variance_[:7].sum()
 62584.661944243671

Using these 7 PCA variables in my linear regression model, I obtain 0.20 as the r2 value.

To improve the results, I add 39 more features (76 in total). Following the same steps, PCA results indicate 71052.64 as the variance explained with 76 components. However, when I run linear regression using these 76 components, I obtain a r2 of 0.06 that is less than 0.20 of the previous model.

I wonder if I am doing something wrong in my experiment. Is this behavior expected? I thought I would get a better model with a PCA that explains more variance.

These r-squared values are from the training set or from the test set? Or obtained via cross-validation? — amoeba, Feb 14 '17 at 10:02
@amoeba From the test set. I split the data into 2 partition: 0.7 (training), and 0.3 (testing). You think this behaviour is weird? — renakre, Feb 14 '17 at 10:04
I see. No, it's not weird. I think your question is a duplicate, wait a second - I will provide a link. — amoeba, Feb 14 '17 at 10:05
@amoeba thanks for the link! It helps but I still do not know what should be my next move? — renakre, Feb 14 '17 at 10:08
@amoeba in order to improve my prediction model? I am including more data about students' activities, but things get worse with more information. — renakre, Feb 14 '17 at 10:11
Yes, and the link explains why. I don't get your question. You can use cross-validation or train/test split to find the optimal number of PCs to put into your regression. — amoeba, Feb 14 '17 at 10:12
+1 to amoeba. An additional question to the OP: is your R2 observed one or adjusted (population estimate) one? — ttnphns, Feb 14 '17 at 10:12
@amoeba then I believe the way I chose the number of PCs is not valid (i.e., choosing the once with positive variances explained)? — renakre, Feb 14 '17 at 10:14
@ttnphns I am very sorry but I am not sure about your question... I am using MS Azure ML to run the experiment with 600 rows, and r2 is the `coefficient of determination` in the result. — renakre, Feb 14 '17 at 10:15
Renakre, please glance in [wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination) and compare the formulas of the two with the formula in the MS Azure ML documentation. — ttnphns, Feb 14 '17 at 10:19
Have a look at partial least squares (PLS). You can think of it as a PCA for regression, where instead of maximizing variance it maximizes covariance. — Georg M. Goerg, Feb 14 '17 at 13:41
@renakre If your goal is regression, then you should choose the number of PCs based on the r-squared, not based on the explained variance. Explained variance is almost irrelevant for prediction. By the way, if you agree that this is a duplicate question (and I do think it is), please click on the "agree" button (or similar) above. — amoeba, Feb 15 '17 at 08:01
@amoeba thanks, I did. Do you think PCA is also not suitable for regression? In another thread, I was recommended to use ridge regression. — renakre, Feb 15 '17 at 08:19
Ridge regression IMHO makes more sense in almost all circumstances. But many people use PCA+regression (it's called "principal component regression", PCR) and if it's standard in your field or if you find it conceptually more attractive it's fine to use it. — amoeba, Feb 15 '17 at 09:39

Adding more features improves the variance explained by PCA but the prediction model performs worse

0 Answers0