0

I have created a linear model based on 2 (interacting) independent nominal variables and 1 dependent interval variable.

Most parameters I've found were statistically significant, but the parameters related to 1 value of an independent variable came back as insignificant. The F-statistic for the entire model also was significant.

How should I interpret this? Does this mean my fitted model is wrong? Can I still use these coefficients to create a reliable model and make conclusions? What steps can I take to create a better model?

Don't know if it's useful, but I will also describe my data and approach in R. The dependent variable contains measurements of protein counts in the blood. The independent variables are nominal with respectively 2 possible values (i.e. gender) and 3 values (i.e. custom group indicating the severity of a specific disease). These 2 independent variables are expected to interact.

In R, the following linear model was used:

lm(formula = proteinCount ~ gender + severity + gender:severity )

The coefficients found for parameters genderMale:severityGroup3 and severityGroup3 both came back as insignificant while the coefficients for other parameters all had p-values below 0.001.

HenryP96
  • 3
  • 2

1 Answers1

0

Don't get misled by nominally "insignificant" coefficients for particular levels of a multi-level categorical predictor. That's particularly true if your model will be used for prediction, as (if you're not overfitting with too many predictors for your number of observations) there's little to be gained by removing any predictor associated with outcome, and you run a risk of omitted-variable bias if you remove them. But it's also true even if your modeling is restricted to evaluating "significance" of predictors.

You have 3 levels of your severity predictor. With the default coding in R, the coefficients reported for severity_group3 (individually and in the interaction term) represent differences from the reference level of that predictor, presumably severity_group1. Nevertheless, the severity predictor seems to have a strong association overall with outcome (again, both in the individual term and in the interaction). This suggests that severity_group2 might be particularly important. My guess is that if you used severity_group2 as your reference level you would have found "significant" p-values for both severity_group1 and severity_group3.

You need to gauge the overall contributions of a multi-category predictor to the model, not just the differences of particular levels from the reference that the individual coefficients represent. You need a test that combines information coming from all levels of the predictor. Your situation has the form of a classic two-way analysis of variance (continuous outcome evaluated among combinations of two categorical predictors), which can provides such overall measures. If the numbers in each combination of predictors aren't all the same size, however, there are some issues you need to consider in how best to do the ANOVA.

For related approaches, Frank Harrell's rms package uses a Wald test to get such overall estimates of predictor and interaction significance. You might also look at Russ Lenth's emmeans package, which provides useful ways to examine and present the results of many types of analyses when groups aren't balanced.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thank you for the detailed answer. It was very helpful, especially since I'm pretty new to statistics. – HenryP96 Nov 22 '20 at 19:35