Logistic regression to predict the fraction of small fish among all fish

Question

My question is regarding to which logistic regression test fits my goal best.

My data set contains 641 rows of which each row is one sample with several independent variables (continuous, nominal and ordinal). However I'm a bit confused on how to classify my response variable. The response variable is constructed as follows:

N-breams (length class 16-40cm) /
  (N-breams (length class 16-40cm) + N-breams (length class 40cm+).

This results in a response variable within a range of 0-1. Where the number higher than 0.5 have more breams of length class 16-40cm compared to 40cm+ and vice versa.

In a normal aquatic system the ratio should be higher than 0.5 (or even 1.0), however this isn’t always the case where the ratio is lower than 0.5 (or even 0.0). I'm interested which environmental variables influences this ratio.

So, initially I thought of binomial distribution which looks like this in R (using GLM or GLMM):

glm(y ~ x1 + x2 + x3, family = binomial)

With an output which predicts the probability (0-1) in respect to a significant independent variable. This is the part where I get confused. Since the 0.5 value is a "tipping point" which means that every predicted/fitted value (from the output) lower than 0.5 has more 40cm+ breams than 16-40cm, RIGHT? Or are we talking about chances? So that a 0.5 value is a 50% chance?

Question

So my real question is whether the predicted values are chances (%) or still remain ratio values (but predicted like with the output of a poisson or normal model). I'm almost certain that this regards the latter, but somehow I'm still doubting.

Output of logistic regression is probabilities http://stats.stackexchange.com/questions/227009/logistic-regression-how-to-call-the-output/228212#228212 , and using rule $\hat y > 0.5$ can be misleading, check: http://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-logistic-classification — Tim, Sep 28 '16 at 10:56
Would beta regression be another option? It is available in R. — mdewey, Sep 28 '16 at 11:22
agree with @mdewey beta regression is better. for logistic you input counts of trials (successes/failures - so in effect the model you are fitting is 1 row per fish). for Beta each row truly represents a fish sample count (eg one lake) [ ie there is a difference between 2 rows of 20 40 and 10 rows of 2 4] — seanv507, Sep 28 '16 at 11:51
Thanks for the quick response! This is exactly what I'm looking for. Instead of predicted probabilities as a logistic regression will provide, a beta regression will produce a predicted ratio. Also found a great paper about beta regression: [https://www.google.nl/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiTyaaqhbLPAhXGrxoKHcRnDMcQFggeMAA&url=https%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fbetareg%2Fvignettes%2Fbetareg.pdf&usg=AFQjCNF2yuOADiTPpZVJ9zz36ID3A7cQng&sig2=hp_MmAd8xcukpdfWbl-7DQ] — Mark, Sep 28 '16 at 12:16
However, I'm still wondering how to deal with this 0.5 threshold. Can I still use/implement this threshold? Can I correspond to the independent value when this thresholds becomes <0.5? — Mark, Sep 28 '16 at 12:26
I would have thought you can put the threshold anywhere you want that makes scientific sense. — mdewey, Sep 28 '16 at 12:40
You can still do logistic regression if you assume that in a given set of conditions, there is a given probability for a bream to be N+ or N- and that each bream is either N+ or N- independently of all the others, e.g. with no interactions among each other that drive up or down N+ or N- (whether that's reasonable or not is up to your expertise). Then you could just make a bigger dataset with N+ entries with y=1 and N- entries with y=0 for each row. — jwimberley, Sep 28 '16 at 12:52
@jwimberley, I also thought about dividing my response variable in y=1 (>0.5) and y=0 (<0.5). But then I would also lose information the gradiënt between 0:1 (which is also important!) — Mark, Sep 28 '16 at 13:07
@Mark This isn't what I'm suggesting, though. I'm suggesting that each separate bream fish becomes a separate row in the dataset with a value y=0 or y=1, so that you have $\sum_{i=1}^{641} N_i$ rows. *If* the assumptions mentioned above are reasonable, this would also give what you want. There are a few other advantages as well that won't fit in a comment. — jwimberley, Sep 28 '16 at 13:36
@jwimberley +1 to your suggestion, but one does not need to manually increase the size of the dataset; for example in R, `glm` function can deal with proportion data if supplied by the `weights` argument. Mark, this is **not** a beta regression, it is logistic regression. See here: http://stats.stackexchange.com/questions/26762 Your Q might be a duplicate. — amoeba, Sep 28 '16 at 13:39
@amoeba Thanks -- though I think it must be done manually because the row isn't just being duplicated identically, but its shape is being changed as well: a row with ``N- = 5, N+ = 10`` becomes 5 rows with `y=0` and 10 rows with `y=1`. — jwimberley, Sep 28 '16 at 13:40
@jwimberley Depends on the software I guess, but in R you don't need to replicate anything manually. See http://stats.stackexchange.com/questions/26762. — amoeba, Sep 28 '16 at 13:41
@amoeba Ah, thanks; I'd forgotten that fitting to the two-column result matrix was possible, too. — jwimberley, Sep 28 '16 at 13:55
@jwimberley agreed you can 'use' logistic regression (that's what I said) - but that is not what Mark wants to model - he wants to model aquatic systems not individual fish. — seanv507, Sep 28 '16 at 14:22
@amoeba, If I apply a logistic regression wouldn't that provide predicted probabilities instead of predicted ratio's? Why wouldn't it be a beta regression (or zoib?) — Mark, Sep 28 '16 at 14:28
@Mark The probability tells you the ratio: if $p_+$ is the probability of being $N_+$ then $R = p_+/(1-p_+)$ is the predicted ratio. Perhaps beta regression gives a more unbiased estimate of $R$; is this the motivation for it? But the probability is still just as good of a descriptor of the population. — jwimberley, Sep 28 '16 at 14:33
Ah oke, thanks! So that answers my initial question. However I'm still confused with applying the `weights` argument in a `GLM`, since the ratio is generated by a/(a+b). Would the "total" then be a+a+b (simply put)? — Mark, Sep 28 '16 at 14:42
Mark, the weight is the total count, i.e. a+b. Your response is the probability of observing a. If the probability is 0.6, then the ratio would be around 0.6. There is no conflict between "ratio" and "probability", it's the same thing. Predicted probability = predicted ratio. — amoeba, Sep 28 '16 at 14:57
@seanv507: The beta regression model doesn't seem very appropriate for counted fractions: the response takes discrete values; & a sample count of 2 small fish & 4 big ones *should* have less weight than one of 20 small fish & 40 big ones. If there are multiple samples taken per aquatic system (& the question isn't very clear on that point) a hierarchical model might be a good idea, or over-dispersion might be allowed for with a quasi-likelihood approach. — Scortchi - Reinstate Monica, Sep 28 '16 at 16:11
@scortchi - you might be right; I just came across beta regression recently. the discrete value issue is a problem, but what I was imagining was a variable dispersion model where the number of fish was an independent variable. [ so that's what I am claiming is the difference to the logistic regression model - we can distinguish between 10 samples of aquatic systems with 50 fish in and 1 sample with 500 fish ie we distinguish the number of independent measurements]. — seanv507, Sep 28 '16 at 16:52
However, the data types that can be modeled using beta regressions also encompass proportions of “successes” from a number of trials, if the number of trials is large enough to justify a continuous model. In this case, beta regression is similar to a binomial generalized linear model (GLM) but provides some more flexibility – *in particular when the trials are not independent *. In such a situation, the fixed dispersion beta regression is similar to the quasi-binomial model but fully parametric. Furthermore, it can be naturally extended to variable dispersion [quote betareg R vignette] — seanv507, Sep 28 '16 at 17:04
Adding the `weight ` argument produces a much better fit for the model, great! Quick question: normally I compare model based on AIC values. However, no AIC values are produced with a quasi-binomial distribution. I understand that there are ways to produce AIC values manually, however this is (apparently) debatable. Are there other ways to compare the goodness of fit? — Mark, Sep 29 '16 at 07:46
@seanv507: Yes, I see what you mean - & with large sample sizes the intra-sample variability might well be negligible compared to inter-sample variability. (Sounds rather like the old arcsine transformation.) — Scortchi - Reinstate Monica, Sep 29 '16 at 08:31
@Mark: I'm still not sure exactly what this question was asking; it'd be better to ask further questions as questions rather than stringing out the comment thread even more (& search our site to see if they've already been answered). For advice on what kind of models might be appropriate you'll need to give rather more context than you've given here. — Scortchi - Reinstate Monica, Sep 29 '16 at 08:44

Logistic regression to predict the fraction of small fish among all fish

0 Answers0