2

I'm trying to do feature selection along a dataset which has:

 Group       Date      Metric1, Metric 2, Metric 3
Group 1,  2016-03-01     1.0      1.3      2.0
Group 1,  2016-03-01     1.5      1.5      2.2
Group 2,  2016-03-03     2.0      1.8      2.4
Group 2,  2016-03-04     2.5      1.0      1.0
Group 3,  2016-03-05     1.0      2.0      1.5
Group 3,  2016-03-05     1.1      2.3      1.0

Previously, I analyzed the model using the model for feature selection with:

model <- train(ATV~., data=data, trControl=train_control, method="lasso")
importance <- varImp(model, scale=FALSE)

The problem is that when combined, the group variables are showing a different relation than when separate. The ultimate question is which are the most powerful generalized explanatory variables within each group? (i.e which feature has a powerful explanation of the DV within all groups).

I think this is a similar trait to Simpson's paradox: http://vudlab.com/simpsons/

Sorry if this seems like a basic question, but what I'm trying to understand: How would you recommend doing a generalized feature selection across these groups? I've considered a few other methods, but would like to hear your opinion before moving forward.

  • Why do you actually need feature selection? – gung - Reinstate Monica Mar 14 '16 at 22:59
  • I would like to build a simplified explanatory model that explains a response. Therefore, I want to use the least amount of features as possible (given there are a lot of features, with challenges of collinearity) with the most explanatory power. I started off with doing a correlation analysis, and reducing features that were highly correlated. I felt feature selection was a strong next step to reducing dimensionality. – andor kesselman Mar 14 '16 at 23:04
  • Feature selection is typically a dangerous thing to do. It may help to read my answer here: [Algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). Moreover, the issue of which variable is 'best' is generally unanswerable b/c variables are typically incommensurate. – gung - Reinstate Monica Mar 15 '16 at 00:03
  • @gung Thank you. This was a fantastic explanation based upon your prior explanation. I upvoted it. I do have one final question, do you have another link to a CV post that highlights some of the dangers with regression analysis with groups (such as above), broken down into multiple levels? How would you recommend approaching a dataset such as above, with individual days among groups. My concern is that the relationship between among a group is different than the relationship "in" a group and I want to avoid incorrectly interpreting analysis. – andor kesselman Mar 15 '16 at 01:18
  • I'm not sure. It sounds like you need to include an interaction term in your model. – gung - Reinstate Monica Mar 15 '16 at 01:40

0 Answers0