Is it wrong to choose features based on p-value?

Question

There are several posts about how to select features. One of the method describes feature importance based on t-statistics. In R varImp(model) applied on linear model with standardized features the absolute value of the t-statistic for each model parameter is used. So, basically we choose a feature based on its t-statistics, meaning how precise is the coefficient. But does the preciseness of my coefficient tells me something about the predictive abilities of the feature?

Can it happen that my feature has a low t-statisstics but would still improve (lets say) accuracy of the model? If yes, when would one want to exclude variables based on the t-statistics? Or does it give just a start point to check the predictive abilities of non-important variables?

For a one-sample test of the mean, the t statistic is simply the sample mean divided by the estimated standard error (sample standard deviation divided by square root of sample size). That statistic by itself *doesn't* depend on any particular hypothesis. Deriving a p value from that statistic *does* depend on a hypothesis. — Dan Hicks, Jul 12 '17 at 17:50
I'm not very familiar with caret, but it seems that `varImp()` is intended to be an informative or diagnostic function and not directly used for feature selection or elimination. — david25272, Jul 13 '17 at 01:46

Matthew Drury · Accepted Answer · 2017-07-13T15:24:31.730

14

The t-statistic can have next to nothing to say about the predictive ability of a feature, and they should not be used to screen predictor out of, or allow predictors into a predictive model.

P-values say spurious features are important

Consider the following scenario setup in R. Let's create two vectors, the first is simply $5000$ random coin flips:

set.seed(154)
N <- 5000
y <- rnorm(N)

The second vector is $5000$ observations, each randomly assigned to one of $500$ equally sized random classes:

N.classes <- 500
rand.class <- factor(cut(1:N, N.classes))

Now we fit a linear model to predict y given rand.classes.

M <- lm(y ~ rand.class - 1) #(*)

The correct value for all of the coefficients is zero, none of them have any predictive power. None-the-less, many of them are significant at the 5% level

ps <- coef(summary(M))[, "Pr(>|t|)"]
hist(ps, breaks=30)

In fact, we should expect about 5% of them to be significant, even though they have no predictive power!

P-values fail to detect important features

Here's an example in the other direction.

set.seed(154)
N <- 100
x1 <- runif(N)
x2 <- x1 + rnorm(N, sd = 0.05)
y <- x1 + x2 + rnorm(N)

M <- lm(y ~ x1 + x2)
summary(M)

I've created two correlated predictors, each with predictive power.

M <- lm(y ~ x1 + x2)
summary(M)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1271     0.2092   0.608    0.545
x1            0.8369     2.0954   0.399    0.690
x2            0.9216     2.0097   0.459    0.648

The p-values fail to detect the predictive power of both variables because the correlation affects how precisely the model can estimate the two individual coefficients from the data.

Inferential statistics are not there to tell about the predictive power or importance of a variable. It is an abuse of these measurements to use them that way. There are much better options available for variable selection in predictive linear models, consider using glmnet.

(*) Note that I am leaving off an intercept here, so all the comparisons are to the baseline of zero, not to the group mean of the first class. This was @whuber's suggestion.

Since it led to a very interesting discussion in the comments, the original code was

rand.class <- factor(sample(1:N.classes, N, replace=TRUE))

and

M <- lm(y ~ rand.class)

which led to the following histogram

edited Jul 13 '17 at 15:24

answered Jul 12 '17 at 19:44

Matthew Drury

33,314
2
101
132

2

Hmm, why is this p-value distribution not uniform? – amoeba Jul 12 '17 at 20:08
@amoeba That's an excellent point! The effect does not appear to go away as I increase the number of data points. Hopefully a better statistician than I will weigh in. – Matthew Drury Jul 12 '17 at 20:14
4

Wow, how did you pick the seed number? Any other results in nearly uniform ps... – psychOle Jul 12 '17 at 20:47
3

I try to always use the same seed for this sort of thing: https://en.wikipedia.org/wiki/154_(album) – Matthew Drury Jul 12 '17 at 20:50
@MichaelM I would venture that in the presence of much superior alternatives, yes, it is wrong. – Matthew Drury Jul 12 '17 at 20:56
12

You are conducting the wrong tests: you are comparing 499 group means to the first group mean. With the seed 154, the first group mean of 1.18... is unusually high (which can happen because the group size of 5 is so small), so most of the others have significantly negative effects. Fix it by running the model `lm(y ~ rand.class - 1)`. This doesn't change the validity of all your remarks (+1). To be even more convincing, balance the group sizes: `rand.class – whuber Jul 12 '17 at 21:03
1

Of course : / I 100% expected @whuber to drop in, and say something completely clear and obvious that I had missed. I'll fix it up now. – Matthew Drury Jul 12 '17 at 21:25
Take your M model and run a summary on it. You will get the following result: Residual standard error: 0.9968 on 4500 degrees of freedom Multiple R-squared: 0.09411, Adjusted R-squared: -0.006547 F-statistic: 0.935 on 500 and 4500 DF, p-value: 0.8369 – Gustavo Mirapalheta Apr 24 '20 at 23:44
@MatthewDrury, I disagree with your main conclusion: “The t-statistic can have next to nothing to say about the predictive ability of a feature, and they should not be used to screen predictor out of, or allow predictors into a predictive model. “. Indeed, in general, p-value can be successfully used as inclusion/exclusion rule about regressors in a predictive models. Maybe I can add my reply later. – markowitz Jul 16 '21 at 12:17

score 2 · Answer 2 · answered Jul 12 '17 at 18:56

2

The t-statistic is influenced by the effect size and the sample size. It might be the case that the effect size is non-zero but the sample size is not big enough to make it significant.

In a simple T-test for zero mean (which is analogous to testing if a feature's influence is zero) the T statistic is $t=\left(\frac{\overline{x}}{s}\right) \sqrt{n} $

$\frac{\overline{x}}{s}$ is the sample estimate of the effect size, if it is small then the p-value won't show its significant until the $\sqrt{n}$ term becomes large.

In your case any feature with non-zero effect will improve performance but you may not have enough data to make that feature's p-value significant.

answered Jul 12 '17 at 18:56

Hugh

3,659
16
22

4

I do not think it is true that any feature with a non-zero effect will improve performance. Maybe this is true on the training data, but it certainly is not on the *test* data. – Matthew Drury Jul 12 '17 at 19:25
@MatthewDrury Are you saying we lack methods for inferring population measurements from samples? – Todd D Jul 13 '17 at 14:57
No, but it is true that spurious features can interfere with your capability to do that well. – Matthew Drury Jul 13 '17 at 15:17
1

@MatthewDrury, I know this is an old thread but there's one important point being missed here: the p-value of the model as whole. The value of the whole model can be measured by the F-statistics of the model. Another way of seeing it is the model's p-value. Whenever this p-value is small, you can be sure that at least one of the predictors is linearly correlated with the model's output. What you are saying about the variables p-value is true although this does not mean that the p-value concept as a whole is flawed. – Gustavo Mirapalheta Apr 24 '20 at 23:40

Is it wrong to choose features based on p-value?

2 Answers2

P-values say spurious features are important

P-values fail to detect important features

Linked