How to interpret Pr(>|t|)
of factor variables?
The reason asking is the following:
summary(lm(dta$X.U.FEFF..mpist. ~ dta$matem + dta$aidink + dta$sukup))
gives
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 253.630 12.763 19.873 < 2e-16 ***
dta$matem 20.247 1.626 12.452 < 2e-16 ***
dta$aidink 19.385 2.146 9.035 < 2e-16 ***
dta$sukuptyttö -24.904 3.903 -6.381 2.69e-10 ***
whereas
summary(lm(dta$X.U.FEFF..mpist. ~ factor(dta$matem) + factor(dta$aidink) + factor(dta$sukup)))
gives
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 541.962 59.549 9.101 < 2e-16 ***
factor(dta$matem)5 -12.744 20.771 -0.614 0.539649
factor(dta$matem)6 8.502 20.433 0.416 0.677432
factor(dta$matem)7 28.455 20.479 1.389 0.165007
factor(dta$matem)8 47.978 20.578 2.332 0.019924 *
factor(dta$matem)9 76.060 20.791 3.658 0.000267 ***
factor(dta$matem)10 80.058 21.637 3.700 0.000228 ***
factor(dta$aidink)5 -77.283 57.656 -1.340 0.180419
factor(dta$aidink)6 -47.841 56.304 -0.850 0.395704
factor(dta$aidink)7 -47.055 56.329 -0.835 0.403712
factor(dta$aidink)8 -24.940 56.424 -0.442 0.658578
factor(dta$aidink)9 7.286 56.593 0.129 0.897593
factor(dta$aidink)10 15.862 57.040 0.278 0.781006
factor(dta$sukup)tyttö -26.160 3.895 -6.716 3.15e-11 ***
Notice that the dta$matem
and dta$aidink
have very good Pr(>|t|)
scores prior to factoring them, but taking the factor "reveals" that the individual factors have bad Pr(>|t|)
scores.
I believe that taking factor()
is the right way to use these variables, since they are all categorical and hierarchical. So is it just that the non-factor()
way is simply wrong?
What about the factor()
way? Should I infer from this that the variables dta$matem
and dta$aidink
are not significant, since the Pr(>|t|)
scores (of individual categories) are mostly $> 0.05$?
About my variables:
aidink
and matem
are grades for native language and mathematics respectively. Their range is integers $[4,10]$, where $10$ is the best score and $4$ the worst. sukup
indicates gender so it's two-valued $\{poika, tyttö \}$.
Also, are the factor()
results suggesting that rather than using all the categories, I might get a better fit by devising a variable that divides the grades into some other sets than just individual values? I could e.g. make a dichotomic variable that responds to aidink
$\geq $ 7 and aidink
$\lt$ 7. Would this kind of thing improve my model?
Or perhaps I could change the reference using relevel()
? But how should I pick the reference?
Also, if I use relevel()
, then what should I do with the categories that still get a bad Pr(>|t|)
score?
E.g.
summary(lm(dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem),7) + relevel(factor(dta$aidink), 7) + factor(dta$sukup)))
where relevel 7 corresponds to grade 10 (and relevel 1 would correspond to grade 4)
produces
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 637.881 9.251 68.954 < 0.0000000000000002 ***
relevel(factor(dta$matem), 7)4 -80.058 21.637 -3.700 0.000228 ***
relevel(factor(dta$matem), 7)5 -92.801 10.507 -8.832 < 0.0000000000000002 ***
relevel(factor(dta$matem), 7)6 -71.555 8.870 -8.067 0.00000000000000208 ***
relevel(factor(dta$matem), 7)7 -51.602 8.195 -6.297 0.00000000045636633 ***
relevel(factor(dta$matem), 7)8 -32.080 7.878 -4.072 0.00005029906001530 ***
relevel(factor(dta$matem), 7)9 -3.998 7.610 -0.525 0.599454
relevel(factor(dta$aidink), 7)4 -15.862 57.040 -0.278 0.781006
relevel(factor(dta$aidink), 7)5 -93.144 17.089 -5.451 0.00000006347719659 ***
relevel(factor(dta$aidink), 7)6 -63.702 10.518 -6.057 0.00000000197514848 ***
relevel(factor(dta$aidink), 7)7 -62.917 9.288 -6.774 0.00000000002150480 ***
relevel(factor(dta$aidink), 7)8 -40.802 8.505 -4.797 0.00000185649013415 ***
relevel(factor(dta$aidink), 7)9 -8.576 8.227 -1.042 0.297474
factor(dta$sukup)tyttö -26.160 3.895 -6.716 0.00000000003152835 ***
You see that grades 4 and 9 still get a bad Pr(>|t|)
score even though the other grades get a good score.
Is it possible to remove only individual (the insignificant) categories from the factors or should this be avoided?