I am doing a prediction assignment as part of a machine learning course using loans data. I have just done some exploratory data analysis on my dataset of just over 9000 rows. There are 11 variables which I need to use to predict the output. I have discovered that 5 pairs are strongly correlated with each other (corr_coeff > 0.5) while 24 pairs have a correlation of less than 0.1.
With this line, its easy to see which is which in my correlation matrix:
df_corr_small = df_corr.apply(lambda x: [y if abs(y) <= 0.1 else 'more' for y in x])
df_corr_large = df_corr.apply(lambda x: [y if abs(y) >= 0.5 else 'less' for y in x])
Can anyone tell me what implications this might have on my further analysis?
I can choose any model. I'm thinking of using PCA to lessen the collinearity problem and get the features. Then something like decision tree/GradientBoostingClassifier - the dataset size maybe too small for a neural network to learn. would that be a recommended approach?
Or is it better to do something else with the strongly correlated variables (corr = 0.54..0.71) before proceeding with any of the classifiers?