1

I am doing a prediction assignment as part of a machine learning course using loans data. I have just done some exploratory data analysis on my dataset of just over 9000 rows. There are 11 variables which I need to use to predict the output. I have discovered that 5 pairs are strongly correlated with each other (corr_coeff > 0.5) while 24 pairs have a correlation of less than 0.1.

With this line, its easy to see which is which in my correlation matrix:

df_corr_small = df_corr.apply(lambda x: [y if abs(y) <= 0.1 else 'more' for y in x])
df_corr_large = df_corr.apply(lambda x: [y if abs(y) >= 0.5 else 'less' for y in x])

Can anyone tell me what implications this might have on my further analysis?

I can choose any model. I'm thinking of using PCA to lessen the collinearity problem and get the features. Then something like decision tree/GradientBoostingClassifier - the dataset size maybe too small for a neural network to learn. would that be a recommended approach?

Or is it better to do something else with the strongly correlated variables (corr = 0.54..0.71) before proceeding with any of the classifiers?

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
  • [Does this answer your question?](https://stats.stackexchange.com/questions/50537/should-one-remove-highly-correlated-variables-before-doing-pca) – DYZ Aug 26 '21 at 21:34
  • it sort of says that including/not including one of the pairs of the highly correlated variables is subjective 'based on the analytical objectives and knowledge of the data.' I really cant decide and looking for advice what to do with them in my case. I'm tempted to both include and exclude , and see that happens in terms of accuracy. – Maria Bruevich Aug 27 '21 at 13:14

0 Answers0