Decide which distribution function to use in GLM for a complex response variable?

Question

I have a problem and have been searching for similar issues for the past weeks with no luck.. hopefully someone could assist me.

As part of my research, I'm trying find the factors most affecting weed infestation in agricultural fields in different growing regions. Each field was sampled by a fixed grid and the infestation level (0-10, ordinal) was evaluated at each node. In order to characterize the overall severity level of weed infestation in a field, I created an index (SV) to be used as a response variable in a GLM:

SV = ML*Mp + b*HL*Hp

where ML is the proportion of area with medium infestation levels (2-5) out of total field area, Mp is the number of patches with medium infestation, b is a factor (integer), HL is the proportion of area with high infestation levels (6-10) out of total field area and Hp the number of patches with high infestation levels. My question is how do I choose the most appropriate distribution for this RV? SV contains percentage (proportion of the fields, binomail) as well as descrete (patches no., poisson) data. Here is a partial data set of 59 fields (from one year of the research)

SV=c(0.316768323167683,5.34683468346835,19.3534353435344,8.4823,8.46724017205162,6.39245698279312,7.78955791158232,4.87484993997599,9.85118511851185,6.62682536507301,5.23447655234477,6.66360459311033,2.7887943971986,1.15398460153985,4.55078984203159,4.06780678067807,5.888966690007,0.770522947705229,0.148014801480148,26.274376189999,6.55432283858071,0.664965986394558,34.8225645129026,3.01179882011799,1.69233846769354,0.43328335832084,3.60030015007504,7.78377837783778,20.4363690892732,3.263723150358,7.77121697357886,5.64927014597081,6.24397560243976,6.19445833750125,4.10436261757054,6.20895820835833,1.4235,12.78118752499,2.77122287771223,5.0977097709771,0.200620062006201,11.7252824152754,2.59514534012586,0.766923307669233,5.0961903809619,16.0358,17.6547,3.42228445689138,0.8412047228337,3.44388877775555,10.1805,3.18133626725345,6.65284300989308,1.63622724544909,4.48490301939612,2.79158757118593,5.10215913634546,4.05363217930758,4.80845746276117)

I've tried exploring different distributions using fitdist() and gofstat(), it appears that exponential or gamma are the best fit. Here is the summary of the goodness of fit test for three distributions:

Goodness-of-fit statistics
                               exp   Weibull     gamma
Kolmogorov-Smirnov statistic 0.1291606 0.1131535 0.1103246
Cramer-von Mises statistic   0.1894177 0.1381864 0.1277033
Anderson-Darling statistic   0.9338532 0.7261726 0.6881040


 Goodness-of-fit criteria
                                    exp  Weibull    gamma
Akaike's Information Criterion 339.8609 341.2056 340.8842
Bayesian Information Criterion 341.9384 345.3607 345.0393

Is it safe to assume that either gamma or exponential distributions will fit? And if it is indeed exponential, can I run a GLM with this family? I'm sorry if this question is a bit naive, but I've been reading quite a lot about GLM and haven't come across this type yet.

score 0 · Accepted Answer · answered Jun 14 '21 at 19:34

0

Yes, we can run GLMs with exponential distribution, see supported distributions from so called exponential family.

Regarding, choosing between Gamma and exponential distributions, one needs to obtain uncertainties on the goodness of fit (GoF). We could do bootstrap estimation of AIC for example. If GoF measures do not deviate much, i.e., overlapping uncertainties, we could use either of the distribution.

This is purely empirical approach, however, we would also need to consider how response is generated from the experiment. For example, if it is coming from poisson point process, exponential distribution would be more suitable to use.

Related thread How to decide which glm family to use?. Though you have already established that support of your response is $[0, \infty]$.

answered Jun 14 '21 at 19:34

msuzen

1,709
6
27

Thank you @MehmetSuzen, this is very helpful. I've read the thread you suggested, that was what led me to these two distributions (gamma and exp.). Could you please elaborate about obtaining uncertainties of the GoF and the bootstrapping process? – Roni Gafni Jun 15 '21 at 11:25
@RoniGafni By uncertainties, I meant bootstrap estimate of standard error. we can obtain multiple GoF via using subsets of the data with replacement, which gives us opportunity to estimate standard-error on the GoF and mean values. For example see [How to bootstrap in R](https://stackoverflow.com/questions/51341146/how-to-bootstrap-in-r). – msuzen Jun 15 '21 at 18:08
Many thanks @MehmetSuzen! – Roni Gafni Jun 16 '21 at 05:27

score 0 · Answer 2 · answered Jun 14 '21 at 21:22

It seems like the distribution fitting you have done are on the marginal distribution of the response. But what you need is the conditional distribution given the predictor variables, see for instance Family of GLM represents the distribution of the response variable or residuals? or GLM with empirical distribution. Since the marginal distribution observed in the data will be a mixture of the conditionals, it is difficult to judge that directly from the data.

So you should start out modeling, choosing a family/link-function combination, see Family in GLM - how to choose the right one?, GLM: verifying a choice of distribution and link function. Then you can compare the possibilities by AIC, cross-validation and looking at residual plots.

Thank you so much @kjetilbhalvorsen ! I've seen so many remarks regarding the marginal distribution of the response variable and only now it sank in.. — Roni Gafni, Jun 15 '21 at 11:06

Decide which distribution function to use in GLM for a complex response variable?

2 Answers2

Linked