Not all conditional distribution are normal - how to analyze

Question

I tested the response of a weed's biomass (dry weight- DW) to a herbicide sprayed at three different phenological stages (4-6, 6-8 and 8-10 true leaves). The experimental design also included four different populations of this weed (biotypes collected from different geographical regions).

Here is the structure of my data, It’s a relatively small data set with 5 replications for each combination:

str(data)
tibble [120 x 4] (S3: tbl_df/tbl/data.frame)
 $ population: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 2 2 2 2 2 ...
 $ pheno     : Factor w/ 3 levels "4_6","6_8","8_10": 1 1 1 1 1 1 1 1 1 1 ...
 $ treatment : Factor w/ 2 levels "Control","Herbicide": 2 2 2 2 2 2 2 2 2 2 ...
 $ DW        : num [1:120] 0.69 0.09 0.89 0.84 0.92 0.29 0.6 0.85 0.38 0.15 ...

The response variable DW is considered normally distributed usually, and initially I thought about running a simple LM. However, for the youngest phenological stage, 4-6 TL, the herbicide injured the plants, especially for one of the populations (D) which is more sensitive. When I run shapiro.test() for each phenological stage across population for the no-spray control, the DW is normally distributed and the same for the less sensitive pheno stages 6-8 and 8-10 that were sprayed with the herbicide. But for the herbicide-treated 4-6 stage the p.value is 0.0018.

I'm not a statistician, and trying to grasp the idea of conditional distribution and GLM for a while now. Somebody suggested I run a GAM, but I read this post saying that a categorical-variables-only gam is essentially a GLM. Is it ok to run LM even though not all of the conditional distributions are normally distributed? Or should I run a GAM? or GLM with family=gaussian? There is a strong possibility I misunderstood the hole conditional distribution matter altogether.. And if this is the case I would appreciate some direction to relevant sources of information!

--Edit--

Given the much appreciated comments, I would like to add a follow-up question. Fitting a GLM with family=Gamma(link="log") does seem more reasonable, and that is what I did:

Call:
glm(formula = DW ~ pheno * treatment + population, family = Gamma(link = "log"), 
    data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.53916  -0.21793  -0.02238   0.18246   0.93584  

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -0.347274   0.122428  -2.837 0.005422 ** 
pheno6_8                      0.373920   0.141367   2.645 0.009353 ** 
pheno8_10                     0.494900   0.141367   3.501 0.000669 ***
treatmentHerbicide           -0.538135   0.141367  -3.807 0.000231 ***
populationB                  -0.087105   0.115426  -0.755 0.452061    
populationC                   0.006129   0.115426   0.053 0.957750    
populationD                  -0.427240   0.115426  -3.701 0.000335 ***
pheno6_8:treatmentHerbicide   0.702339   0.199923   3.513 0.000642 ***
pheno8_10:treatmentHerbicide  0.736103   0.199923   3.682 0.000359 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.1998466)

    Null deviance: 64.538  on 119  degrees of freedom
Residual deviance: 45.280  on 111  degrees of freedom
AIC: 154.45

Number of Fisher Scoring iterations: 9

Here are some diagnostic plots:

However, when I fit a GLM with family=gaussian(link="identity") (which is essentially a LM?) the fit appears to be much better:

Call:
glm(formula = DW ~ pheno * treatment + population, family = gaussian(), 
    data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.50146  -0.14168  -0.02424   0.11672   0.54109  

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   0.65541    0.06433  10.187  < 2e-16 ***
pheno6_8                      0.30100    0.07429   4.052 9.45e-05 ***
pheno8_10                     0.40600    0.07429   5.465 2.87e-07 ***
treatmentHerbicide           -0.20995    0.07429  -2.826 0.005588 ** 
populationB                  -0.03333    0.06066  -0.550 0.583729    
populationC                   0.06600    0.06066   1.088 0.278901    
populationD                  -0.20830    0.06066  -3.434 0.000837 ***
pheno6_8:treatmentHerbicide   0.35795    0.10506   3.407 0.000915 ***
pheno8_10:treatmentHerbicide  0.43745    0.10506   4.164 6.21e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.05518593)

    Null deviance: 17.0922  on 119  degrees of freedom
Residual deviance:  6.1256  on 111  degrees of freedom
AIC: 3.5442

Number of Fisher Scoring iterations: 2

And the same plots for the second model:

Both models are not without problems, especially for those plants of the sensitive population which all died (and resulted in a near-zero DW and small variance), but judging by what I know: lower AIC, the deviance of the residuals, the diagnostic plots.. normal approximation works better. Why is this happening? And which is the correct choice (if any)?

[The p-value from a normality test is less useful than one might hope.](https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) // A GAM model, without further considerations, just changes how you predict the conditional distribution, but it still assumes a normal conditional distribution. — Dave, Dec 07 '21 at 14:25
Thank you so much @Dave for your quick response. I took the time to read the post you linked, and tried (to the best of my ability) to get a sense of the many perspectives. I can't say I understood it, but would like to know if I have the right notion as to how to analyze my data. I think it is not about testing for normality, rather assume that my response is normally distributed (due to its nature), and fit a LM, and look at the diagnostics. If R^2, RMSE, F statistics are reasonable, then I can go on to inference? I hope I'm not completely off here, just trying to get it right. — Roni Gafni, Dec 08 '21 at 10:46
Unless your biomass data have a relatively large mean (it doesn't look like it), it isn't too likely that they'll be well approximated by a Gaussian response model. The raw distribution of the response is irrelevant here; what we want to know is the distribution of the individual observations as a function of one or more covariates. As biomass can't be negative, the variance of the data must tend to zero at that point to, and if it decreases to zero when y=0, it must increase as y tends away from zero. In other words the variance is non-constant — Gavin Simpson, Dec 08 '21 at 11:02
For something like biomass, I would use the Tweedie family in *mgcv* specifically (it allows for 0 biomass as an option, and is `family = tw()`) or use a gamma family via `family = Gamma(link = "log")`. Both are defined on the positive real values numbers (with the Tweedie also allowing for 0s, hence defined on the non-negative reals), and the gamma is a special case of the Tweedie when the power parameter of the latter is 2. — Gavin Simpson, Dec 08 '21 at 11:05
Thank you @GavinSimpson! your suggestion makes sense and this is what I did, however looking at the QQ plot of the `glm(…,family = Gamma(link = "log"))` vs. the LM's it became even more unclear to me. I've edited my question and added the relevant information, hoping it might shed a light for people like me who obviously were not given the best statistics toolbox but are not giving up :) — Roni Gafni, Dec 08 '21 at 19:51
[Here](https://stats.stackexchange.com/questions/554346/is-there-an-assumption-free-anova/554474#554474) is a related thread. — Geoffrey Johnson, Dec 08 '21 at 20:52
The titles in the plots imply different terms in the models (purely additive in the second set, additive plus one second order interaction) while the summary indicates model terms were the same across both models. — Gavin Simpson, Dec 08 '21 at 21:11
Thank you @GeoffreyJohnson, I looked into your post and further on GEE. If I understand it correctly, this test assumes different variances but I still have to specify one underlying distribution (i.e. `family=gaussian`). If this is the case, I have to choose Gamma as @GavinSimpson suggested? or will normal distribution be ok? — Roni Gafni, Dec 09 '21 at 10:10
You can use either. Choosing gamma will use the mean-variance relationship of the gamma distribution when estimating the mean. Choosing gaussian will produce the least-squares estimate of the mean. In either case you also have the choice of specifying a correlation structure and whether to use the empirical sandwich estimator for standard errors. A universally acceptable approach is to specify a gaussian family with an independent correlation structure and request the sandwich estimator for standard errors. — Geoffrey Johnson, Dec 09 '21 at 12:29
To decide on which link function (identity or log) to use, fit both models and plot the estimated curve on top of the scatter of data points. The log link will produce a non-linear curve while the identity link will produce a straight line. The AIC for the identity link is considerably lower than the AIC for the log link, suggesting the straight line is a better fit to the data. — Geoffrey Johnson, Dec 09 '21 at 12:38
Could you please be more specific as to how to request the sandwich estimator for se? and also I didn't understand the plot you suggested. Did you mean the fitted vs. observed for both models? Thanks.. — Roni Gafni, Dec 09 '21 at 13:52

Not all conditional distribution are normal - how to analyze

0 Answers0