I have a problem and have been searching for similar issues for the past weeks with no luck.. hopefully someone could assist me.
As part of my research, I'm trying find the factors most affecting weed infestation in agricultural fields in different growing regions. Each field was sampled by a fixed grid and the infestation level (0-10, ordinal) was evaluated at each node. In order to characterize the overall severity level of weed infestation in a field, I created an index (SV) to be used as a response variable in a GLM:
SV = ML*Mp + b*HL*Hp
where ML is the proportion of area with medium infestation levels (2-5) out of total field area, Mp is the number of patches with medium infestation, b is a factor (integer), HL is the proportion of area with high infestation levels (6-10) out of total field area and Hp the number of patches with high infestation levels. My question is how do I choose the most appropriate distribution for this RV? SV contains percentage (proportion of the fields, binomail) as well as descrete (patches no., poisson) data. Here is a partial data set of 59 fields (from one year of the research)
SV=c(0.316768323167683,5.34683468346835,19.3534353435344,8.4823,8.46724017205162,6.39245698279312,7.78955791158232,4.87484993997599,9.85118511851185,6.62682536507301,5.23447655234477,6.66360459311033,2.7887943971986,1.15398460153985,4.55078984203159,4.06780678067807,5.888966690007,0.770522947705229,0.148014801480148,26.274376189999,6.55432283858071,0.664965986394558,34.8225645129026,3.01179882011799,1.69233846769354,0.43328335832084,3.60030015007504,7.78377837783778,20.4363690892732,3.263723150358,7.77121697357886,5.64927014597081,6.24397560243976,6.19445833750125,4.10436261757054,6.20895820835833,1.4235,12.78118752499,2.77122287771223,5.0977097709771,0.200620062006201,11.7252824152754,2.59514534012586,0.766923307669233,5.0961903809619,16.0358,17.6547,3.42228445689138,0.8412047228337,3.44388877775555,10.1805,3.18133626725345,6.65284300989308,1.63622724544909,4.48490301939612,2.79158757118593,5.10215913634546,4.05363217930758,4.80845746276117)
I've tried exploring different distributions using fitdist()
and gofstat()
, it appears that exponential or gamma are the best fit. Here is the summary of the goodness of fit test for three distributions:
Goodness-of-fit statistics
exp Weibull gamma
Kolmogorov-Smirnov statistic 0.1291606 0.1131535 0.1103246
Cramer-von Mises statistic 0.1894177 0.1381864 0.1277033
Anderson-Darling statistic 0.9338532 0.7261726 0.6881040
Goodness-of-fit criteria
exp Weibull gamma
Akaike's Information Criterion 339.8609 341.2056 340.8842
Bayesian Information Criterion 341.9384 345.3607 345.0393
Is it safe to assume that either gamma or exponential distributions will fit? And if it is indeed exponential, can I run a GLM with this family? I'm sorry if this question is a bit naive, but I've been reading quite a lot about GLM and haven't come across this type yet.