2

I read several posts and resources including this, this, and this. I understand the colinearity problem in machine learning. I also get why LASSO method becomes unreliable if the predictor variables are highly correlated and it leads to shrinking of the coefficients of highly correlated variables toward zero while keeping only one variable in the regression equation.

My question is what is the cutoff in colinearity that starts to pose a problem for LASSO method? My use case scenario is looking at a set of ~200 genes and finding which ones are the most important variables to explain a response variable (such as survival or tissue classification). I saw papers that used LASSO method to select features, and I'd like to employ a similar approach to shrink down this list to something workable (10-15 genes). I don't necessarily want the correlated variables to be dropped from the regression though, since they may be simultaneously important. This sounds like the Ridge method at this point, but the problem with that is it won't perform feature selection as LASSO does. Is there a correlation cutoff between any variables like rho = +/- 0.75 or something like it to signal LASSO method is not good to use with the given dataset?

To give you a sense of the data I'm working with, this is how the correlation matrix of the 200 genes looks like:


dat <- scale(expression_data)

heatmap(cor(dat), show_colnames=F, show_rownames(data)

Correlation matrix heatmap

Thanks!

Atakan
  • 591
  • 1
  • 4
  • 14
  • if you need to keep correlated variables, then it is not a matter of prediction, for which you do selection, but rather inference? – StupidWolf May 04 '20 at 21:03
  • I think what you have is more complicated, than the usual feature selection. ridge might be work, i think it's used in selecting genetic markers etc, but u still have the problem of defining a cut off based on the coefficients.. – StupidWolf May 04 '20 at 21:17
  • My main goal is inference rather than prediction at this point. However, I might want to see if the model is meaningful in other datasets later on. – Atakan May 04 '20 at 21:22
  • ermm so what if you just correlate each gene with your response variable? that would give you 10-15 genes – StupidWolf May 04 '20 at 21:24
  • anyway the answer to your question is no.. there is no correlation cutoff, it really depends on your dataset. if you have more observations, the correlation might not hurt so much – StupidWolf May 04 '20 at 21:27
  • We initially performed some correlation-based analyses to obtain this list of 200 genes. I was wondering if fancy machine learning methods can help me further focus on a subset of variables that are most relevant for the response variable. I will also try random forest and elastic net approaches as well. Lasso is an interesting one because it can perform feature selection internally, and I found an interesting manuscript using lasso approach. Thanks for the comments – Atakan May 04 '20 at 22:30
  • @StupidWolf, what would be a good approach to understand if I'm "OK" or not with the dataset I have? Is running both Lasso and Ridge analyses and seeing which of the dropped variables in Lasso have a high-ish coefficient in Ridge a good approach? I assume this would reveal highly correlated variables (equally weighed in Ridge but all except one dropped in Lasso). Am I wrong in this thinking? – Atakan May 05 '20 at 04:08
  • yes that might work.. I worked on this sometime ago, let me see if I can pull out something that worked 1/2 way – StupidWolf May 05 '20 at 12:49

0 Answers0