1

I have to make a forecast about one variable and I use some different methods to do so. I would try a nonlinear alternative also and I would to consider the Artificial Neural Network (ANN) models.

I have many predictors and I know that about linear models make no selection among them is not a good idea, because in this case become highly probable to encounter in overfitting. I don't know if with ANN the same is true, maybe overfitting occur primarily if we use too many hidden layers and/or nodes.

However I would to use some reduction/selection rule also. Now, I know that PCA is one possible technique for data reduction. Actually I have more that 130 predictors but with less than 30 component (after standardization) I capture more than 70% of total variability. I can use these components as predictors but my point is the follow.

I know that ANN is more effective if there are highly nonlinear links between predictors and predicted variable. Now, the PC are a linear combinations of predictors and then I fear that components that contain the most part of variability tend to obscure the nonlinear relations. If this is true seems me that PCA is not a great idea for reduce the number of predictors in ANN model. This is true?

If it is, what method is preferable as variables selection and or reduction for ANN? To use all predictors that I have is better?

markowitz
  • 3,964
  • 1
  • 13
  • 28
  • What problem are you trying to solve with feature selection? What bad thing are you worried about happening if you don't do it? – Sycorax May 27 '20 at 13:49
  • I’m not sure that predictors selection is better, in fact I use the full model also. However I fear that the argument of parsimony, frequently used in prediction, matters. I fear this especially because I use many predictors with several lag included, and I have chosen them without theoretically justified selection. Even if in pure prediction a theoretical justification for include predictors are not necessary and even if ANN permit us to calibrate a model regardless the ratio between number of observations and predictors, seems me hard to justify that more predictors are always better. – markowitz May 28 '20 at 05:32
  • For example I have seen somewhere that some authors use BIC criterion to select predictors for linear regression and, after, they use the same selection for ANN. I fear that procedure like this is not good, due to the motive above; selection techniques oriented for linear model can have poor property in nonlinear one. I do not know any selection technique thinked for ANN, PCA seems me a quite agnostic ones. These was the main things I worried about. – markowitz May 28 '20 at 05:33

1 Answers1

3

Using PCA to project the data to a lower dimension can yield worse results. This is as true for neural networks as it is for any other model. This is because the projection to a lower dimension is not aware of the outcome. All it does is retain the PCs with the largest variance; if the informative features PCs with lowest variance, then PCA will make the model worse because you're excluding the signal.

On the other hand, if you only use PCA to rotate the data but not project it to lower dimension, PCA can be useful because it improves the optimization dynamics.

In "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", Sergey Ioffe and Christian Szegedy suggest that whitening transformation are helpful during the optimization steps.

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and de-correlated.

How to select features for inclusion depends on the domain. The best feature selection is to use domain knowledge to eliminate features which are not related to the outcome. For tabular datasets where domain knowledge is unavailable, generic feature selection methods like can be helpful, even though they're not foolproof. There's always some risk that you'll leave out important features, or include irrelevant features. Also, the results obtained from Boruta will depend on the model you use to measure feature importance, which introduces additional knobs to turn.

See also: Are dimensionality reduction techniques useful in deep learning

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • 1
    First of all thank for your answer, give me more time to read it more carefully. However, I knew that standardized predictors are better for ANN but nothing about them correlations, thanks. If I understood well, retain all PC (no truncation) is a good idea. The fact that I stay in predictive/supervised learning with financials data (returns) and that major part of them (not all) shown low correlations, permit you to add something? – markowitz May 28 '20 at 05:34