1

Running the anova() function on a linear regression object returns p-values for each parameter in my model. However, I do not understand what specific test it is running (I think it is a Wald Chi-Square) or how it chose it.

The documentation for the anova function is very sparse and does not explain this detail- is the function computing the F ratio of the model with the parameter and without or is it doing something different? How does the Wald Chi-Square test differ from ANOVA?

Edit:

I am running regression (specifically geepack) on a dataset. The output returns the parameters (betas), including each level of the categorical variables, along with p-values.

My first question is what specific test is being run on each of these parameters? Is it testing whether beta = 0 or is it testing if the parameter is contributing to the model (or is this the same thing)?

Next, I eliminate covariates that are p<0.1 by running anova() on the model to determine which variables are significant. Instead of looking at the categorical variables by level as before, it tests the variable overall and determines a p-value.

Again, I am not sure what exact test it is using to get a p-value and what the test is asking (beta = 0 or contributes to model?)

Edit #2:

library(tidyverse)
library(broom)
library(geepack)

    data <- structure(list(Cluster = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 6L, 6L, 6L), Visit = c(0, 18, 
24, 0, 6, 18, 24, 0, 6, 12, 24, 18, 0, 6, 18, 24, 6, 0, 6, 18
), Gender = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Male", "Female"
), class = "factor"), Visit_age = c(37, 38, 39, 22, 23, 24, 24, 
20, 21, 21, 22, 22, 36, 37, 38, 38, 22, 42, 42, 43), AHQ_YearsSimilarFreq = c(13, 
13, 13, 5, 5, 5, 5, 12, 12, 12, 12, 12, 6, 6, 6, 6, 11, 4, 4, 
4), AHQ_Out_Games_MainPos = structure(c(3L, 3L, 3L, 4L, 4L, 4L, 
4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L), .Label = c("Forward", 
"Midfield", "Defense", "Goaltender"), class = "factor"), AlcWeek_Category = structure(c(2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L), .Label = c("I do not drink", "Light Drinking (M:1-7, F:1-2)", 
"Mod-Heavy Drinking (M:8+, F:3+)"), class = "factor"), CigsYN = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), Concussion = structure(c(3L, 
3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L), .Label = c("0", "1", "2+"), class = "factor"), SchoolYears = c(19, 
19, 19, 16, 16, 16, 16, 15, 15, 15, 15, 15, 20, 20, 20, 20, 15, 
20, 20, 20), MHQ_Heading_Male_Quartile = structure(c(NA, NA, 
NA, 4L, 4L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 1L, 3L, 2L, 3L, NA, NA, 
NA, NA), .Label = c("0-6", "6.01-15", "15.01-53", "53.01+"), class = "factor"), 
    AHQ_Heading_Male_Quartile = structure(c(NA, NA, NA, 4L, 4L, 
    4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA
    ), .Label = c("0-435", "435.01-861", "861.01-1964", "1964.01+"
    ), class = "factor"), MHQ_Heading_Female_Quartile = structure(c(1L, 
    1L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    4L, 2L, 3L, 3L), .Label = c("0-2", "2.1-7", "7.1-24", "24.1+"
    ), class = "factor"), AHQ_Heading_Female_Quartile = structure(c(1L, 
    1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    3L, 2L, 2L, 2L), .Label = c("0-108", "108.1-430", "430.1-1278", 
    "1278.1+"), class = "factor"), MHQ_Unintentional_Impacts_Category = structure(c(3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 
    3L, 1L, 3L, 2L), .Label = c("0", "1", "2+"), class = "factor"), 
    AHQ_Unintentional_Impacts_Category = structure(c(3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 
    3L, 3L, 3L), .Label = c("0", "1", "2+"), class = "factor"), 
    ISL_cor = c(27, 33, 30, 29, 28, 28, 29, 33, 28, 32, 27, 25, 
    26, 23, 22, 24, 26, 31, 31, 30), GMCT_mps = c(0.066638, 0.166589, 
    1.06656, 1.83315, 1.39986, 1.366439, 1.732987, 1.89962, 1.599627, 
    1.76649, 1.79964, 1.49935, 0.69993, 0.499967, 1.133031, 1.033575, 
    1.866542, 1.932174, 1.866418, 1.666167), IDN_lmn = c(2.712426, 
    2.666696, 2.691574, 2.60214, 2.66739, 2.666007, 2.683421, 
    2.64958, 2.637634, 2.592968, 2.627395, 2.626299, 2.794555, 
    2.799669, 2.883404, 2.76647, 2.668145, 2.691837, 2.711009, 
    2.665336), ONB_acc = c(NA, 1.570796, 1.570796, NA, 1.2692, 
    1.188903, 1.322056, NA, 1.393086, 1.393086, 1.322056, 1.393086, 
    NA, 1.322056, 1.225939, 1.2692, 1.393086, NA, 1.2692, 1.322056
    ), TWOB_acc = c(1.325818, 1.570796, 1.570796, 1.230959, 1.162158, 
    1.570796, 1.325818, 1.230959, 1.395827, 1.325818, 1.273674, 
    1.395827, 1.021329, 1.230959, 1.273674, 1.325818, 1.230959, 
    1.273674, 1.395827, 1.570796), ISRL_cor = c(9, 10, 12, 11, 
    11, 11, 10, 12, 12, 12, 11, 11, 10, 10, 9, 8, 11, 11, 12, 
    12), PA_Total = c(21, 23, 24, 20, 13, 20, 21, 21, 24, 23, 
    24, 21, 16, 21, 22, 23, 23, 19, 22, 22), Spatial = c(0.451184, 
    0.166667, 0.083333, 0.5, 1.339256, 1.533743, 1.622678, 0.333333, 
    0.166667, 0.416667, 0.333333, 0.25, 0.970857, 1.235702, 2.41257, 
    1.27022, 0.686887, 1.206559, 0.436339, 1.235702), Flanker = c(206.48, 
    106.09, 68.9, 155.93, 102.03, 129.46, 135.85, 89.29, 134.4, 
    125.57, 104.29, 105.33, 182.3, 108.25, 92, 109.75, 93.85, 
    41.9, 75.25, 16.43)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))


# Start of GEE Modeling --------------------------------------------------------------------------------------------------
covs <- c("Visit_age", "AHQ_YearsSimilarFreq", "AHQ_Out_Games_MainPos", 
         "AlcWeek_Category", "CigsYN", "Concussion", "SchoolYears")
# Fit GEE model
model <- function(new_data, outcome, preds, covs) {
  form = as.formula(paste(outcome, paste(c(preds, covs), collapse = " + "), sep = " ~ "))
  fit <- geeglm(form, family = gaussian(link = "identity"), data = new_data,
                id = Cluster, waves = Visit, corstr = "independence", std.err = "san.se")
  return(fit)
}

run <- function(gender, outcome, preds) {

  # New dataframe to model
  new_data <- data %>%
    arrange(Cluster, Visit) %>%
    filter(Gender == gender) %>%
    select(Cluster, Visit, outcome, preds, covs) %>%
    na.omit() # geeglm only works on complete data

  # First run with all covariates
  fit <- model(new_data, outcome, preds, covs)

  # Retain covariates if p < 0.1
  covs <- tidy(anova(fit)) %>% 
    filter(term %in% covs, p.value < 0.1) %>%
    pull(term)

  #Second run with significant covariates only
  fit <- model(new_data, outcome, preds, covs)
  output <- tidy(fit) %>% 
    filter(str_detect(term, paste(paste0("^", preds), 
                                  collapse = "|")))
  return(list(output, covs))
}
run("Male", "ISL_cor", "MHQ_Heading_Male_Quartile") 
Hank Lin
  • 439
  • 4
  • 12
  • Not absolutely sure, but your difficulty may be that your 'group' variable is of type 'numeric' instead of type 'factor'. Then you would get results for a regression not for a one-way analysis of variance. – BruceET Sep 05 '18 at 17:45
  • 3
    R is object oriented, so the test performed by `anova()` depends also on the class of the model. Anyway, to test simple ANOVA there is also the function `aov()` which is quite handy. – matteo Sep 05 '18 at 18:09
  • This question is probably off-topic because it is about programming rather than statistics, but matteo's comment is pretty critical and that's probably what you are missing. – Bryan Krause Sep 05 '18 at 18:14
  • What exactly R is doing & why depends on your situation, as @matteo notes. Can you say more about your situation, your data, & your model? Without more information, I'm not sure this is answerable. – gung - Reinstate Monica Sep 05 '18 at 18:47
  • @BryanKrause, I worry that this is too unclear to get a correct answer (it could get an answer to a subtly different question, eg). That said, "what specific test", "how to choose", "How does the does the Wald Chi-Square test differ from ANOVA", etc, are all on-topic statistical questions, IMO. – gung - Reinstate Monica Sep 05 '18 at 18:49
  • @gung Yeah, the way I interpret the OP though it seems like they are more confused that anova() in R is doing things that isn't in the documentation for the base anova() function, probably because they don't realize it is being overloaded. – Bryan Krause Sep 05 '18 at 18:52
  • I added more clarification to my comment. If needed, I can also provide the code but that would get messy... – Hank Lin Sep 05 '18 at 19:14
  • 1
    Thanks. Can you say what your variables are? What kind of data is the response? Can you post a small sample from your data? Can you post the output from your model & the `anova()` call? (BTW, eliminating covariates w/ p>0.1 is almost certainly a bad thing to do.) – gung - Reinstate Monica Sep 05 '18 at 19:34
  • Hi gung, I added my code to the post. The predictors are a mix of categorical and continuous variables. Response is continuous. Let me know if you have trouble running the code. I've heard that eliminating covariates with p>0.1 is bad, but that's how I was taught- is there a better way? – Hank Lin Sep 05 '18 at 19:45
  • 1
    In your case, `anova()` calls a class-specific version, as suggested by @matteo, whose code you can read by typing `geepack:::anova.geeglm` at the R command prompt once the package is loaded. (Yes, it is poorly documented.) A quick look at the code suggests that this is a Type I Anova in which terms are evaluated sequentially in order of entry into the formula. See [this page](https://stats.stackexchange.com/q/20452/28500) for an explanation of the different Anova types. Type I can lead to difficulties with inference if you have correlated predictors in your model. – EdM Sep 05 '18 at 20:18
  • @gung. With comments and edits, this clearly become a different question than I intended to answer. I don't see a straightforward answer to the revised question, as I understand it. At least for now, I'm deleting my Answer so it will not distract from getting an answer that helps OP. – BruceET Sep 06 '18 at 00:18
  • @EdM Do you know of any way to run a Type III anova on a gee class object? I have tried both drop1 and Anova() with no success.. – Hank Lin Sep 10 '18 at 20:52
  • I don't have any experience with Type III anova on gee class objects. I would suggest re-writing this question (with a new title and adding the `gee` tag) to ask more specifically about how to do analysis of variance properly on GEE models. The more details you can provide in the question about the specific hypothesis you're testing, the more likely you are to get a useful answer. Also, consider whether a GEE is more appropriate to your problem than is a standard linear mixed model. – EdM Sep 11 '18 at 14:12
  • @EdM I will definitely do that. Is there a way to maybe convert a gee class object to a glm class object or at least extract the necessary sum of squares info to run an anova on? – Hank Lin Sep 11 '18 at 16:13
  • Don't know of any; seems unlikely as there are differences in their approaches. Note that even for standard linear mixed models there is dispute over whether and how to perform significance tests; see [this thread](https://stats.stackexchange.com/q/22988/28500) for instance. I expect that there are similar issues with GEE (I have almost no experience with GEE, just experience trying to figure out R code), which is why I recommend reposting specifically with respect to ANOVA and GEE. And in your revised question ask about proper covariate selection methods for your situation. – EdM Sep 11 '18 at 16:44

0 Answers0