1

I have a dataset that has 1216 columns and 104 observations. I want to somehow quantify numerically, how much each of the columns influence a change (drop or raise) in the value of the target variable, or at least, get to know which of the columns have the most influence over the target.

At a first attempt i have thought about simply interpreting the coefficients of a linear regression model, but since there are many more variables than observations, the model will achieve a perfect fit and i don't know if in this conditions the coefficients are meaningful to determine the influence of the variables over the target.

I have also thought about lasso and ridge regression, but these methods forces the coefficientes to be near 0, so how could i know in which measure a variable influences the target if the coefficients are 0 or near 0?

My question is, which approach should i use here? Which method would be more helpful to determine and quantify in a meaningful way how much a variable is influencing the target?

I'm using Python for this analysis so if you you could point or suggest a method that can be implemented in Python it would also be very helpful.

Thank you very much in advance.

Miguel 2488
  • 135
  • 3

1 Answers1

1

I would try a PCA. As you should know that a PC is a linear combination of one or more original features.

Once I have fitted the model on PCs and got the feature importances/coefficients, I would use it to proportionately allocate among the individual members of the PCs.

jdsurya
  • 136
  • 3
  • Hi @jdsuryap. Thank you for your answer. Could you give a little more detail about the last part of your answer? How many principal components would be more appropriate to use in this case? And when you say: "I would use it to proportionately allocate among the individual members of the PCs" how could i exactly allocate these importances to each member of the PC's? Thank you – Miguel 2488 Apr 18 '21 at 14:08
  • Hi @Miguel2488, the optimum number of PCs should be guided by using elbow method using a metric like explained variance, something like in this post https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained. Also, regarding "how could i exactly allocate these importances" - I would use the weights of the original features in the PCs to allocate the importance/coefficient of the PCs. – jdsurya Apr 18 '21 at 15:02