Binomial & numerical variables as dependent and independent + random variable

Question

I am new to statistics and trying to figure out how to analyze my data correctly. I completed a biological study with the following variables: (I have converted my binary variables to 1 and 0) -Type of food (1/0) -Starvation (1/0) -Days to hatch (numerical) -Days to adulthood (numerical) -Infection (1/0) -Sex (1/0) -Production of silk (1/0) -Weight (numerical) -Weight silk (numerical)

I proceeded by creating a binary logistic regression model to test for cocoon presence as such:

model1 <- glm(silk ~ diet*starvation*infection.level*Sex, family = binomial(link="logit"), data=cocoon)

I also have a random variable but I was not sure how to add it to the model (see below how I used 'provenience').

Is this correct? Correct me if I am wrong but I used a Shapiro-Wilk normality test on the model and obtained the following:

data:  model1.res
W = 0.69624, p-value < 2.2e-16

Does that mean my model doesn't fit the data? Lastly, how would I graphically represent the data?

For the numerical data I used the following model:

model2 <- lmer(weight ~ diet*starvation*infection.level*Sex * (1|provenience), data=cocoon, REML = TRUE)

and obtained:

fixed-effect model matrix is rank deficient so dropping 4 columns / coefficients

How many "silk" events of each type do you have? You might be pushing your data too far with all the possible interactions among your 4 binary predictor variables. I wonder whether that might have something to do with the warning message. Also I don't think that there's a reason to suspect that residuals from a logistic regression model would be normally distributed, so the Shapiro-Wilk test wouldn't be appropriate. — EdM, Jan 07 '20 at 18:32
About 100-200 each. That's what I thought as well (for the normality test). But would I need to do it for my second model? the one with "weight". Thanks! — Ares96, Jan 07 '20 at 18:37
Questions: 1) Is provenience also a relevant grouping variable for silk? If so, you would want to use `glmer()`, which has the same syntax structure as `lmer()`. 2) Do you want to run a model with a 4-way interaction? If instead you want to model the predictors such that each is adjusted for the others, you would place a + in between each variable. Either way, in your lmer syntax, replace the last * with a +. — Erik Ruzek, Jan 07 '20 at 23:12

score 1 · Answer 1 · answered Jan 08 '20 at 17:26

Briefly, as noted in comments: You can specify a random effect for logistic regression in the glmer() function in the same way as you did for linear regression in lmer(). Residuals in a logistic regression would not be expected to be normally distributed. (For reference, in linear regression it's good to visualize the residuals as a function of predicted values rather than rely on Shapiro-Wilk.) An introduction to validation of logistic regression models is on this page.

With 4 binary predictors in the logistic regression model (ignoring the random effect for now) you have only 16 possible combinations of their values, so showing the numbers or fractions of outcomes in some type of tabular display of the predictor values could be a useful representation.

Additionally, I see a few issues here based on what's already been presented in the question; more might become obvious if a link to the data becomes available.

First, you might be in danger of over-fitting your logistic regression model with only 100-200 in each of the "silk" event categories. The usual rule of thumb in logistic regression is to evaluate no more than 1 predictor per 15 or so cases in the minority class, unless you are using some method like ridge regression that penalizes regression coefficients. In this context, what counts as a predictor is a binary or continuous variable, each level of a categorical variable beyond the first, and all interactions specified among them in the model.

If the minority class has only 100 cases, you are limited to about 6 or 7 predictors. Your model, however, includes not only the 4 binary predictors but also all 11 possible interactions among them, and incorporation of provenance as a random effect represents at least 1 additional predictor. So unless you can collect more data you need to cut back on the interactions evaluated. As noted in a comment, replacing the "*" operators with "+" in the formulas would restrict analysis to individual effects. If there are specific interactions that you think need to be included in the model based on your understanding of the subject matter, you can denote specific interactions with the ":" operator.

Second, the warning from the second model suggests that some predictors can be expressed as linear combinations of the others in your data set. One place this might be happening is in the combination of diet and starvation: how do you code diet in cases subjected to the starvation treatment? The full data might suggest other sources of this problem

Third, I count 5 outcomes in your data; the associations among those outcomes might also be of interest. Your modeling approach doesn't seem to be taking this into account. As you seem to be in an academic or agricultural-research setting, there should be local statistical expertise available to help you with this. There's a limit to the type and amount of help that can be provided in a forum like this. Working directly with a statistician will be in the best long-term interest both of your project and of your starting to learn experimental design and statistical analysis.

Binomial & numerical variables as dependent and independent + random variable

1 Answers1