1

I'm trying to make a GLM with binomial distribution to evaluate the effect of the diameter at breast height and the occupation of tree termites in species of tree.

I'm using the following model:

model <- glm(ocupation ~ dap * specie, data = termites, family = binomial(link = logit))

My response variable is the occupation of termites on the tree (data of absence and presence); the independent variable is the dap and the interaction with kind of species.

But when I run the model the following errors appear:

Warning message:
glm.fit: odds adjusted numerically 0 or 1 occurred

Error in seq.default(1, length(levels(eval(as.name(qvar)))) - 1, by = 1) : 
  wrong sign in 'by' argument


Error in solve.default(t(X) %*% W %*% X) : 
  system is computationally singular: reciprocal condition number = 1.93217e-38
> 

Warning messages:
1: not plotting observations with leverage one:
  10, 24, 26, 36, 37, 40, 43, 57, 71, 72, 73, 116, 138 
2: not plotting observations with leverage one:
  10, 24, 26, 36, 37, 40, 43, 57, 71, 72, 73, 116, 138 
3: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
4: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced

The same thing happens when I rotate the contrasts in the analysis

Error in seq.default(1, length(levels(eval(as.name(qvar)))) - 1, by = 1) : 
  wrong sign in 'by' argument

Can someone help me? I understand very little about R.

conjugateprior
  • 19,431
  • 1
  • 55
  • 83
  • 1
    Could you say some more about how many cases there are with 'occupation' for each species (or, if 'occupation' is more frequent, how many lack 'occupation')? This type of result could either be something fairly simple (like not enough data for this type of analysis) or something more subtle like the [Hauck-Donner effect](https://stats.stackexchange.com/q/45803/28500). – EdM Nov 04 '17 at 20:34
  • I believe @EdM probably has the correct diagnosis. – conjugateprior Nov 05 '17 at 03:53
  • 1
    If `dap` is diameter, as @EdM thinks, shall we assume that the research questions are something like 'what is the effect on infestation probability of having a larger diameter?' (a no interaction model), and then 'what is the increase of probability, if any, from what we would expect based on diameter due to being a particularly tasty or untasty species?' (an interaction model)? – conjugateprior Nov 05 '17 at 17:04
  • @EdM, Yes. The DAP mean diameter. I have 45 cases with occupation and 141 cases without occupation. I think my sample is good, but i'm able to cntinue the analysis. I was doing it that way as you said, seeing the model first without interaction and then if there is to do with the interaction. But the reviewer said that the correct thing would be to evaluate everything in the same model, with DAP interacting with the plant species. – Jakelyne Suélen Bezerra Nov 06 '17 at 13:40
  • How many different species? Do any species either have 0 occupation or 100% occupation? – EdM Nov 06 '17 at 14:03
  • @EdM, I have 28 different species. I have several with 0 occupation. In many cases for both occupation and non-occupation, I have species with few individuals. For example, the species Albizia myiopoides had only two individuals and in no two had occupation of termite mound. – Jakelyne Suélen Bezerra Nov 06 '17 at 14:22
  • OK, so as per my answer below, your options are a) put more information into the model e.g. use a regularized logistic regression model like [`logistf`](https://CRAN.R-project.org/package=logistf) to provide extra information about the range of plausible coefficients, or b) restrict your data to only the species that have enough information to fit your current model. – conjugateprior Nov 06 '17 at 15:02
  • @conjugateprior, whats would be the amount of extra informations for each specie? – Jakelyne Suélen Bezerra Nov 06 '17 at 15:08
  • Hard to say without pointing at the math, e.g. [here](https://cemsiis.meduniwien.ac.at/fileadmin/msi_akim/CeMSIIS/KB/volltexte/Heinze_Schemper_2002_Statistics_in_Medicine.pdf). logistf adds a Jeffreys invariant prior to all the coefficients that has the effect of damping them so that no logits end up going to infinity and the model can be estimated. – conjugateprior Nov 06 '17 at 15:47
  • But, do consider my suggestion b. above, and @EdM's below. If you can group or restrict the species (in ways that make biological sense) then you'll give the model an easier task. – conjugateprior Nov 06 '17 at 15:49

2 Answers2

4

Most likely you have a problem of perfect separation, meaning there are combinations of dap and specie that have all ocupation = 1 or all ocupation = 0. In the first case, the value of $$ \log \frac{p(\texttt{ocupation}=1)}{p(\texttt{ocupation}=0)}, $$ (the quantity that the linear part of your model is attempting to fit) that makes the data most likely is positive infinity (and in the second case, negative infinity). Since this is silly, the model has objected.

This problem occurs because although you know that the coefficients cannot really be large enough to predict such huge values for the logit, a model fit with maximum likelihood does not.

The second answer here lays out your alternatives.

So much for the statistics. Here are some substantive suggestions:

  1. Can you get a more informative dependent variable? Are termites an all or nothing affair, or are some infestations larger than others? Thinking of this quantity as a magnitude rather than presence/absence often opens up other more stable model choices.
  2. Perhaps one or other of specie and dap is actually the driving force. How does the model look if you remove the interaction between them? I would guess that it works better.
  3. We don't know what dap actually is, but if it is some feature of the environment and the 'species packing' assumption is approximately true, then specie will have a preferred range of dap. And if that is true, an interaction will never work well in a binary data context because there will always be missing combinations of variables.
conjugateprior
  • 19,431
  • 1
  • 55
  • 83
  • +1 I infer, however, from the question that `dap` is tree "diameter at breast height," which would be a continuous covariate. In that case evaluating the interaction would be important. – EdM Nov 05 '17 at 09:49
  • Ah, that sounds plausible. So there's probably a colinearity problem in there as well, if it's the case that species grow to species-typical ranges of diameters. – conjugateprior Nov 05 '17 at 16:54
2

Just need to add a bit more to the answer by @conjugateprior than can fit into a comment easily.

The usual rule of thumb for logistic regression is that you need about 15 events per predictor variable that you are considering in the model. For this purpose, "events" are the least frequent of occupied/not occupied, so that's 45 events. Also, all except one of your species counts as a separate predictor, so in the interaction model with 28 different species and the continuous dap predictor you are trying to evaluate 28 individual predictors PLUS 27 interactions between dap and species, a total of 55 predictors. So you don't even have 1 event per variable. Regularization, as suggested by @conjugateprior in a comment, might help, but at this point it seems that you really don't have enough information to proceed this way. Think carefully about what you are trying to accomplish; might there be some way to combine species (e.g., into genus) in some way to cut your number of predictors down? (That combination of species into groups, however, would need to be done without reference to the occupancy data.)

EdM
  • 57,766
  • 7
  • 66
  • 187
  • +1 Grouping in a biologically reasonable manner is perhaps the best simple idea - essentially it's parameter tying for the ungrouped data, so one could also loosen the coupling by switching to random effects. – conjugateprior Nov 06 '17 at 15:53
  • @conjugateprior I like your idea of treating the species as random effects, if that makes sense in terms of the hypothesis being examined; sounds like the interaction of `dap` with species would still have to be included, but that interaction would also then be a random effect. – EdM Nov 06 '17 at 16:26
  • Yes, @EdM, I can try to sort species together by gender or family, that's a good idea. But in case I continue to evaluate everything together in the same model or would be a model for each set of DAP per family? – Jakelyne Suélen Bezerra Nov 06 '17 at 17:06
  • @JakelyneSuélenBezerra, the problem is that _each_ model needs to have about 15 events per predictor, so your 45 total events limit you to only 3 predictors (including interactions). If your primary interest is in the effect of `dap` on occupancy, you might succeed with a single mixed model that allows for random effects of species (or probably better, groups of species, as species with 0 events might lead to problems) and an interaction between `dap` and the (random-effects) groups. Using random, not fixed, effects would greatly cut down on the effective number of predictors in the model. – EdM Nov 06 '17 at 18:18
  • @EdM, how can I assemble a mixed model with each DAP data and plant group? I still do not know how to make mixed models. – Jakelyne Suélen Bezerra Nov 06 '17 at 19:03
  • @JakelyneSuélenBezerra, before you try to use mixed models you should make sure to try to understand them. There are many threads with the `[mixed-model]` tag on this site. As you are using R, the `glmer()` function in the `lme4` package should do what you need. Note that hints on how to write mixed models for `lmer()` also apply to `glmer()`, as the latter is simply the extension of `lmer()` linear mixed models to generalized models like logistic. I think that something like `~ dap +(dap|plant_group)` for the predictors would do what you want, but I haven't thought it through completely. – EdM Nov 06 '17 at 22:54
  • @EdM, I'm going to try to run a mixed model using the species as a random factor. Thank you all for the help. – Jakelyne Suélen Bezerra Nov 07 '17 at 01:57
  • @JakelyneSuélenBezerra Our pleasure. Also, since we're on a Stack Exchange, you might want to 'accept' an answer if you thought it was helpful to you. Have fun with random effects... – conjugateprior Nov 07 '17 at 18:49