1

Say I have a dataset $X$ consisting of differing RVs which may be continuous and/or nominal and my target set which again may be continuous or nominal. If the number of covariates is large, I would typically want to boil those down into features which hold the greatest importance in my prediction. As such, I would normally conduct hypothesis tests to understand this, however I am under the impression that such tests can be misleading as to whether to include or reject the null hypothesis, and thus drop or include the mentioned columns.

For example, in my ANOVA test, I assume as my null hypothesis all groups of some nominal RV have equal variance and equal mean, i.e. the nominal RV doesn't distinctly categorise between groups. I proceed to conduct my hypothesis test and it suggests strongly, with a p-value of $0.001$, that the group have an equal variance and equal mean, and thus, since it doesn't distinctly categorise each group, that the RV should be dropped. Why in such case would it be a bad idea to not drop this RV?

Any help is appreciated.

Jay Ekosanmi
  • 561
  • 1
  • 10
  • In (very) short: because statistical significance does not measure a sort of "importance" of a variable, or what kind of effect it has on your dependent variable, but instead measures the probability of obtaining such a value assuming a certain sampling distribution. Or in other words, a statistically insignificant result could be understood as "insufficient evidence", which is not the same as "evidence to the contrary". – user2974951 Jan 06 '22 at 09:26
  • Thank you, so what are better estimators of feature importance? – Jay Ekosanmi Jan 06 '22 at 09:28
  • @user2974951 Although statistical significance and importance are different things, in the special case of linear regression it is often recommended to use the t-values of the variables as a measure of importance (the R package `vip`, for instance, returns t-values as "variable importance" by default). t-values are a monotonous transfromation of p-values (or vice versa) and are thus equivalent to using p-values. – cdalitz Jan 06 '22 at 17:02

1 Answers1

2

If the number of covariates is large, I would typically want to boil those down into features which hold the greatest importance in my prediction.

Why do you think you would want to do this if you were not constrained in some other way (e.g. cost of acquisition or memory to store the features)?

Anyway, I've answered a similar question here. The main point is that a small p value gives you absolutely no information about the size of the effect, which is the main consideration. Small and negligible effects can be highly significant. As per my example in the linked answer, the variable Z would be included in the model based solely on significance criteria, yet the model performance is nearly identical with out without it meaning selection using p values can lead you to select unimportant variables.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Thank you (+1), so it indeed seems that p values is used in ML for scenarios that they were not intended for. On this note, can you give some examples of where using p values in ML would be appropriately draw valid inferences? – Jay Ekosanmi Jan 06 '22 at 10:37
  • @jaiyeko P values are about statistical inference, and generally ML does not care about statistical inference. I've yet to see a genuine application in ML where p values are useful. in fact, I'm dubious that p values are useful in places they are intended to be used. – Demetri Pananos Jan 06 '22 at 10:40