How to interpret Pr(>|t|) of factor variables?

Question

How to interpret Pr(>|t|) of factor variables?

The reason asking is the following:

summary(lm(dta$X.U.FEFF..mpist. ~ dta$matem + dta$aidink + dta$sukup))

gives

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     253.630     12.763  19.873  < 2e-16 ***
dta$matem        20.247      1.626  12.452  < 2e-16 ***
dta$aidink       19.385      2.146   9.035  < 2e-16 ***
dta$sukuptyttö  -24.904      3.903  -6.381 2.69e-10 ***

whereas

summary(lm(dta$X.U.FEFF..mpist. ~ factor(dta$matem) + factor(dta$aidink) + factor(dta$sukup)))

gives

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             541.962     59.549   9.101  < 2e-16 ***
factor(dta$matem)5      -12.744     20.771  -0.614 0.539649    
factor(dta$matem)6        8.502     20.433   0.416 0.677432    
factor(dta$matem)7       28.455     20.479   1.389 0.165007    
factor(dta$matem)8       47.978     20.578   2.332 0.019924 *  
factor(dta$matem)9       76.060     20.791   3.658 0.000267 ***
factor(dta$matem)10      80.058     21.637   3.700 0.000228 ***
factor(dta$aidink)5     -77.283     57.656  -1.340 0.180419    
factor(dta$aidink)6     -47.841     56.304  -0.850 0.395704    
factor(dta$aidink)7     -47.055     56.329  -0.835 0.403712    
factor(dta$aidink)8     -24.940     56.424  -0.442 0.658578    
factor(dta$aidink)9       7.286     56.593   0.129 0.897593    
factor(dta$aidink)10     15.862     57.040   0.278 0.781006    
factor(dta$sukup)tyttö  -26.160      3.895  -6.716 3.15e-11 ***

Notice that the dta$matem and dta$aidink have very good Pr(>|t|) scores prior to factoring them, but taking the factor "reveals" that the individual factors have bad Pr(>|t|) scores.

I believe that taking factor() is the right way to use these variables, since they are all categorical and hierarchical. So is it just that the non-factor() way is simply wrong?

What about the factor() way? Should I infer from this that the variables dta$matem and dta$aidink are not significant, since the Pr(>|t|) scores (of individual categories) are mostly $> 0.05$?

About my variables:

aidink and matem are grades for native language and mathematics respectively. Their range is integers $[4,10]$, where $10$ is the best score and $4$ the worst. sukup indicates gender so it's two-valued $\{poika, tyttö \}$.

Also, are the factor() results suggesting that rather than using all the categories, I might get a better fit by devising a variable that divides the grades into some other sets than just individual values? I could e.g. make a dichotomic variable that responds to aidink $\geq $ 7 and aidink $\lt$ 7. Would this kind of thing improve my model?

Or perhaps I could change the reference using relevel()? But how should I pick the reference?

Also, if I use relevel(), then what should I do with the categories that still get a bad Pr(>|t|) score?

E.g.

summary(lm(dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem),7) + relevel(factor(dta$aidink), 7) + factor(dta$sukup)))

where relevel 7 corresponds to grade 10 (and relevel 1 would correspond to grade 4)

produces

Coefficients:
                                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)                      637.881      9.251  68.954 < 0.0000000000000002 ***
relevel(factor(dta$matem), 7)4   -80.058     21.637  -3.700             0.000228 ***
relevel(factor(dta$matem), 7)5   -92.801     10.507  -8.832 < 0.0000000000000002 ***
relevel(factor(dta$matem), 7)6   -71.555      8.870  -8.067  0.00000000000000208 ***
relevel(factor(dta$matem), 7)7   -51.602      8.195  -6.297  0.00000000045636633 ***
relevel(factor(dta$matem), 7)8   -32.080      7.878  -4.072  0.00005029906001530 ***
relevel(factor(dta$matem), 7)9    -3.998      7.610  -0.525             0.599454    
relevel(factor(dta$aidink), 7)4  -15.862     57.040  -0.278             0.781006    
relevel(factor(dta$aidink), 7)5  -93.144     17.089  -5.451  0.00000006347719659 ***
relevel(factor(dta$aidink), 7)6  -63.702     10.518  -6.057  0.00000000197514848 ***
relevel(factor(dta$aidink), 7)7  -62.917      9.288  -6.774  0.00000000002150480 ***
relevel(factor(dta$aidink), 7)8  -40.802      8.505  -4.797  0.00000185649013415 ***
relevel(factor(dta$aidink), 7)9   -8.576      8.227  -1.042             0.297474    
factor(dta$sukup)tyttö           -26.160      3.895  -6.716  0.00000000003152835 ***

You see that grades 4 and 9 still get a bad Pr(>|t|) score even though the other grades get a good score.

Is it possible to remove only individual (the insignificant) categories from the factors or should this be avoided?

The P values are telling you that the mean of level x is different than the mean of the reference level. In your update this would be 10 vs 4, 10, vs 5, etc. You do not want to just drop levels because they are not significant. I think you need to take a step back and define your hypothesis because that will guide the analysis. Do you want to compare each level to 10 as the reference? Do you want to look at a linear trend? Do you want to test the differences between all pairs of levels? — Moose, Oct 13 '16 at 13:16
@Moose I want to use grades as predictors. Since I think that there should be a correspondence between a high mark on a previous test subject (math or native language) and the current test's test score (the response). I don't know what I should use as reference, but using 10 gives the best `Pr(>|t|)` values. However, it might also be that I might need to formulate better what I'm trying to get out of these predictors. — mavavilj, Oct 13 '16 at 13:32
Some possible dups: https://stats.stackexchange.com/questions/138768/confused-on-the-interpretation-of-regression-coefficients, https://stats.stackexchange.com/questions/89438/recoding-a-variable-with-three-levels-into-a-dummy-variable, — kjetil b halvorsen, Apr 04 '19 at 10:05
Use a chunk test for the whole factor, as described at https://stats.stackexchange.com/questions/27429/what-are-chunk-tests — kjetil b halvorsen, Jul 12 '21 at 17:46

score 1 · Answer 1 · answered Oct 13 '16 at 06:47

1

If a variable is categorical, then it should be modelled as a factor. If it is treated as continuous, then you are modelling a linear trend in the data—this is nonsensical if the categories don't have a natural order... even though it may be highly significant.

Since all the independent variables are categorical, then using an ANOVA approach would be much more straightforward than regression. It looks like you would need a 3-way ANOVA.

The regression output is giving you the comparison of each factor level to its reference. See here for explanation.

To speculate on your final question: In an ANOVA, the factors will likely have a significant effect since these results show that the mean of some levels is quite different from the mean of their reference.

answered Oct 13 '16 at 06:47

Moose

1,090
7
12

What do you mean by categories having natural order? – mavavilj Oct 13 '16 at 06:55
I also don't understand what you mean by "The regression output is giving you the comparison of each factor level to its reference.". Could you elaborate this? What is the reference? – mavavilj Oct 13 '16 at 06:56
An example of a category with a natural order would be a histology score that indicates the degree of disease (0=no disease, 4 = is worse disease). – Moose Oct 13 '16 at 07:00
The link I referenced will give an elaboration on the comparisons and provides examples in R. See the part on "dummy coding." – Moose Oct 13 '16 at 07:01
Okay I read the dummy coding part and while I understand the reference now, what I don't understand is how to infer the `Pr(>|t|)` values of different categories relative to the reference? Does it mean that compared to the reference the other categories that get `Pr(>|t|)` > 0.05 are *less significant* than the reference? Also I don't believe that the variables `aidink` and `matem` would not be significant, because by infering their meaning (performance in previous test scores contributes to the measured test scores) they might be significant. – mavavilj Oct 13 '16 at 07:08

score 0 · Answer 2 · answered Mar 02 '19 at 14:20

The p value tells you if that variable that is coded as a factor is a significant predictor of the dependent variable. Regression is just fine for this, but you could use ANOVA as well. To interpret the result, go into the summary and add the estimated beta coefficient to the intercept to get the prediction of the mean for that group. For example, matem9 would be predicted to have a mean X.U.FEFF..mpist. score of 541.962 + 76.060, or 618.022.

How to interpret Pr(>|t|) of factor variables?

2 Answers2

Linked