3

My data contain a bounded continuous variable (score between 0 and 10) representing the efficacy of a given method to control an invasive species. As there are more high scores than low ones, the variable is skewed to the left. My purpose is to identify which independent variables explain this dependent variable. In other words, I would like to know which variables affect the efficacy of the method.
According to some answers I found on CV (e.g. here), a model with a beta-distribution family could be used to model the relationship between my variables. However, considering that my dependent variable is fairly skewed and that I have a small sample size ($n$ = 98), I was wondering:

  1. If a model with a beta-distributed response would be the most appropriate option here? (Provided that I transform my DV so as to lie between 0 and 1, but with no 0 and no 1, right?)
  2. If there was a sort of rule of thumb regarding the minimum number of observations per predictor possibly included in this type of model?
Fanfoué
  • 353
  • 1
  • 13
  • 4
    I don't follow why small sample size would be any more (or any less) of a problem with your proposed approach than with any other. If the limits of 0 and 10 are attainable in principle, and certainly if either limit is attained in practice, I would tend to use binomial family as a reference here. Just make sure that your software uses suitable standard error and P-value calculations. When you say "continuous", does that mean that _any_ non-integer score between 0 and 10 is possible? – Nick Cox Jan 06 '21 at 14:30
  • 2
    Just a small comment about terminology: A beta-regression is not a special case of a generalized linear model (GLM) *sensu stricto.* More on that [here](https://stats.stackexchange.com/questions/304538/why-beta-dirichlet-regression-are-not-considered-generalized-linear-models). – COOLSerdash Jan 06 '21 at 14:43
  • @NickCox Thank you for your comment. Yes, any score between 0 and 10 is possible but 0 are quite rare in practice. I mentioned my sample size with regard to my 2nd question because I think that minimum sample sizes per predictor vary depending on the type of model used: e.g. with _n_ = 98, you could include more predictors in a model with a normal Y than with a binomial Y. Anyway, I'm working with R, would you have any advice regarding the function and packages I should use in this case? – Fanfoué Jan 06 '21 at 16:21
  • Nothing from me on R advice. I am not even a routine R user. If you post some sample data I could show some Stata code and you'd be likely to get good advice on R equivalents. – Nick Cox Jan 06 '21 at 17:14
  • @NickCox I will post my data as soon as I can (I'm still wrangling the data). I ask this question in advance to know where I should start heading. Back to your comment, does that mean that some software use inappropriate SE and p-value calculations by default when running logistic regressions? – Fanfoué Jan 06 '21 at 20:29
  • 1
    I am strongly familar only with `glm` in Stata. There a binomial family (usually with logit link) must be specified with option `vce(robust)` to get half-decent standard errors if the response variable is really continuous. I would be amazed if any other implementation didn't require some such setting. So-called robust standard errors are often named for one or more of Eicker, Huber or White, who all have some claim to publicity on that score. Sandwich is yet another name. – Nick Cox Jan 06 '21 at 20:41

1 Answers1

1

It seems that an ordinal model makes more conceptual sense. A cumulative logit is a standard as is an ordinal logit model (there are slight differences in approaches).

In Harrell's Regression Modeling Strategies, section 4.4 p. 73, the author presents a few considerations for sample size. He says that the number of variables in the model should take into consideration a limiting factor. $p$, the number of variables in the model should be less than $m/20$ where $m$ is the limiting factor.

Just for clarification, in the case of a continuous variable, $m$ is the total sample size, so with 100 in the sample, a model should have at most 5 parameters estimated. Remember that a categorical variable with 7 categories entails 7-1 parameters estimated.

For the case of an ordinal outcome, the sample limiting factor $m$ is given by:

$$ n - \frac{1}{n^2} \sum_{i=1}^{k} n_i^3 $$

where there are $k$ categories in your outcome and $n_i$ is the sample size of category $i=1,...,k$.

Here you have a good tutorial for ordinal response modeling in a Bayesian setting.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Guilherme Marthe
  • 1,249
  • 6
  • 12
  • The OP's description of their measurements leads me to regard this as too pessimistic. – Nick Cox Jan 06 '21 at 17:17
  • True! But we don't know how many covariates he is considering. Also, if some categories have a very low sample size, he could merge them with adjacent categories. The denominator with which we compare the limiting factor can be relaxed to 15 or even 10 to be a bit more permissive. – Guilherme Marthe Jan 06 '21 at 17:30