1

Scikit-Learn doesn't exhibit the p-values for your models. I'm used to look at the p-values - besides a few other factors - when choosing the variables to consider on my final model. However, p-values doesn't see a big deal for Scikit-Learn. Why is that? Isn't p-value important? Can I use a variable even though the p-value is considerably large?

In the question https://stackoverflow.com/questions/59908991/what-is-the-level-of-significance-considered-in-the-logistic-regression-using-sc/59911394#59911394 the user Matias says p-values are not used in Machine Learning. Is that true?

trder
  • 610
  • 3
  • 12

2 Answers2

4

$p$-values are used for hypothesis testing. In machine learning you don't have any hypothesis to test, & you don't care about it. You care about making accurate predictions and hypothesis testing has nothing to do with it.

You can check the The Two Cultures: statistics vs. machine learning? thread for some related discussion.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • +1, and just a comment: in machine learning "hypothesis" has another meaning altogether - it is a concrete instance of a model. And the space of all possible models that one can obtain via optimisation is called "the hypothesis space". – Karolis Koncevičius Jan 26 '20 at 22:43
1

Some Machine Learning techniques are based on p values, e.g., ANOVA feature selection.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html

f_classif (ANOVA) returns a p-value for each feature. As Tim writes the p-value is used for hypothesis testing. ANOVA tests whether means of two or more samples are equal. A low p-value shows that at least 2 samples have different means which is a good indicator for a feature. Usually values below 0.1 or 0.05 or 0.01 mean that this feature could be used. You could use for example SelectKBest (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) to take the first 10 features with the lowest p-values.

methus
  • 47
  • 6