2

I have dataframe which contains numerical and categorical features. I am trying to get p-values of these variables using OLS.

I'm creating dummies to get p-values of categorical features. But in this way im getting p-value for all values in categorical features. My purpose is that get p-value of feature not all values of feature.

How should i interpret of OLS result which contains p-values of dummies? Should i use chi2 test for get the p-value of categorical features? If yes can i use p-values of numerical features from OLS and p-values of categorical features from chi2 test for all data?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
talatccan
  • 121
  • 5
  • What software are you using? – user2974951 Dec 23 '19 at 14:02
  • Python @user2974951 – talatccan Dec 23 '19 at 14:03
  • 1
    "Python" is a Turing-Complete programming language. You need to be a lot more specific than that. You had might as well say "what software?" ... "Computers" – Him Dec 23 '19 at 14:54
  • I dont use any spesific software like SPSS, Weka etc. Im developing software using Python, Scikit-Learn, Imblearn, Statsmodels, Numpy etc. – talatccan Dec 24 '19 at 06:41
  • FAQ: https://stats.stackexchange.com/questions/31690/how-to-test-the-statistical-significance-for-categorical-variable-in-linear-regr, – kjetil b halvorsen Jul 25 '20 at 13:54
  • 2
    Does this answer your question? [How to test the statistical significance for categorical variable in linear regression?](https://stats.stackexchange.com/questions/31690/how-to-test-the-statistical-significance-for-categorical-variable-in-linear-regr) – kjetil b halvorsen Jul 25 '20 at 13:58

2 Answers2

1

You can test the fit of a model with that factor and it’s levels omitted against the full model. That’s basically what’s going on to calculate the p-values of each individual parameter.

The model with the factor and it’s levels is going to fit better, but then the test, in some sense, says if the increase in fit is worth the additional model complexity.

You’re doing an OLS linear model, so the standard test would be an F-test of the sums of squared errors:

$$ \dfrac{\frac{SSE_{reduced}-SSE_{full}}{p_{full}-p_{reduced}}}{\frac{SSE_{full}}{n-p_{full}}}\overset{H_0}{\sim} F_{p_{full}-p_{reduced}, n-p_{full}} $$

“Reduced” means the model with the variables (factor with it’s levels) missing; “full” means the model with those variables included. “SSE” is the sum of squared errors (technically residuals) from the model. “p” is the number of parameters in the model. “n” is the number of observations. “$\overset{H_0}{\sim}$” means distribution under the null hypothesis that the factor does not influence the outcome.

In R, the “lmtest” package has tools to do comparisons like this. The “anova” function in base R might be able to take nested models as inputs, too.

Dave
  • 28,473
  • 4
  • 52
  • 104
0

In case you aren't aware, the p-values returned for each category you are tested assess if the individual dummy is significantly different than your determined reference group. I believe you are looking for a global test to assess if the variable itself is significant. In this case you would use a "chunk test" AKA a "general linear F-test".

See this thread: What are chunk tests?

It is important to note that some dummies in your categorical variable may not be significant, but they should still be left in the model.

geoscience123
  • 291
  • 1
  • 8