4

I have a feature set for each customer [age, gender, income, lifestyle, & so on...] and a response variable say: has_repurchased.

  • I use a logit model summary which shows income & gender to be significant (p<0.05).

    logit_res = logit_mod.fit(method='bfgs')

  • When I run a random forest and perform model.feature_importances_ with the same feature set to obtain a ranking of features: both income and gender are ranked low whereas the topmost feature is age. (which as per logit model is not even significant).

Shouldn't the rf feature importance show highly significant feature has highly ranked? Is this a correct way to interpret these results?

Geet
  • 43
  • 5
  • 3
    Related: https://stats.stackexchange.com/questions/164048/can-a-random-forest-be-used-for-feature-selection-in-multiple-linear-regression/164068#164068 The models are different, so what counts as important is different. There's no universal "importance" metric. – Sycorax May 22 '19 at 01:35
  • The link talks about whether I can use important features into a linear model (as a form of feature selection) and then interpret the coefficients so I can get some p-values. For feature selection, I use Lasso. My question is more on the inner workings of the model - essentially shouldn't the significant features also be important? If not, why? – Geet May 22 '19 at 02:22
  • 3
    The question is different, but the exact same reasoning applies: the linear model doesn't pick up $x_1$ or $x_2$ as being important because the effect isn't linear. The linear model ... is linear, so it can only identify linear relationships. But if a linear model doesn't find a linear relationship, that has no bearing on a non-linear relationship; moreover, a tree-based model can find dependencies based on other relationships, which a linear model can only do if it's specified up front. – Sycorax May 22 '19 at 02:43
  • I would suspect that age will be somewhat correlated with income, large p-values may also be caused by multicollinearity if present, though this wouldn't necessary mean that the variables are insignificant. Remember, that p-values (confidence intervals for ML estimates) inversely depend on the standard errors (hessian) of the estimates, with multicollinearity present and sample size fixed it is harder to fit the correct confidence bounds, hence the larger p-values. – runr May 22 '19 at 09:07

1 Answers1

4

We can extend the reasoning presented in Can a random forest be used for feature selection in multiple linear regression? to this context.

The data in this figure are obviously separated by the circle $1=x_1^2 + x_2^2$, which is a nonlinear boundary in $x_1, x_2$. Nonlinear boundary

Because the relationship is pretty obvious, it shouldn't be surprising that random forest can do a good job of picking out some approximation to this boundary.

On the other hand, a linear model, such as a logistic regression, has the form $$\mathbb{P}(y=\text{red}|x)=f(\beta_0 + \beta_1 x_1 + \beta_2 x_2).$$ Even though $x_1, x_2$ have a nonlinear relationship to the outcome, this linear model can't recognize the relationship between $x_1, x_2$ because the relationship is nonlinear. In order for the linear model to find this relationship, you would have to specify the model such that the features are related to the outcome using a linear form, perhaps an indicator variable that takes the value 1 whenever $x_1^2 + x_2^2 < 1$. It's easy enough to do this here, when we can visualize the data readily. But with noisy data and many features, it's much harder to carry out this transformation. This is the appeal of random forest and similar methods: they can find relationships between features and outcomes with little human intervention.

In fact, because the boundary has the form $1 = x_1^2 + x_2^2$, it is nonlinear and it depends on two features acting together; if $x_1^2$ is small, that is not sufficient for a point to be Red because $x_2^2$ may be large, and all Red points must be small enough in both $x_1^2$ and $x_2^2$.

All of this is a very detailed way to say that what a model considers important is specific to the model itself, and not "universal." Differences in how models go about estimation strongly bear on how the model assigns importance.

There are a number of other caveats around random forest feature importance, beyond the fact that it is not comparable to importance as measured by a linear model. For more information, https://explained.ai/rf-importance/index.html

Additionally, some criticism of interpreting larger coefficients as more important can be found in Gary King, "How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science"

Sycorax
  • 76,417
  • 20
  • 189
  • 313