Missing data and unbalanced data set in 2-factors design

Question

I'm new to statistics and I'm trying to understand what to do with my data!

I have two factors : tree genotype (10 levels) and soil type (3 levels). For one genotype I have only 2 replicates in soil1 and soil2 but 3 replicates in soil3. For 4 genotypes, I have 3 replicates in soil1 and soil2 but 4 replicates in soil3.

Can I analyse the interaction between the two factors using gls (nlme package) with these data? Or should I remove the fourth replicate to make it balanced? And should I remove the entire genotype missing data? Or just the two soils missing a replicate?

I tried removing nothing and it worked. Can I trust these results? I thought it would'nt work since I only have two replicates for some treatments but it did'nt seem to be a problem...

Here's my code :

> my_model <- gls(variable~Soil_type+Genotype+Soil_type:Genotype, data=my_data, na.action = na.omit)

> shapiro.test(resid(my_model, type = "normalized"))

    Shapiro-Wilk normality test

data:  resid(my_model, type = "normalized")
W = 0.99199, p-value = 0.8568

> bartlett.test(resid(my_model, type = "normalized") ~ fitted(my_model, type = "normalized"))

    Bartlett test of homogeneity of variances

data:  resid(my_model, type = "normalized") by fitted(my_model, type = "normalized")
Bartlett's K-squared = 29.118, df = 29, p-value = 0.4589

> anova(my_model)
Denom. DF: 62 
                   numDF   F-value p-value
(Intercept)            1 1664.3700  <.0001
Soil_type              2  121.1435  <.0001
Genotype               9    3.9401  0.0005
Soil_type:Genotype    18    1.3449  0.1930

There is votes to close, which I don't understand. This is ontopic, it is not only about R coding! — kjetil b halvorsen, May 17 '19 at 22:13

kjetil b halvorsen · Accepted Answer · 2019-05-19T20:34:30.290

So you have an unbalanced design, that is maybe not an optimal design, but it is what you got. So you must analyze the data you have, and throwing away parts of the data cannot be a good analysis (except if you have reasons to believe those data are wrong.) Historically one preferred balanced designs also because that led to easier arithmetic ( and fewer hours on the calculator ... ) but you probably have an electronic slave doing the work anyway.

Modern algorithms as those implemented in nlme do not assume balanced designs! Your analysis seems OK, but the use of the Bartlett test should be avoided. It is not robust at all, so if you need to test constancy of variance some modern alternatives could be preferred. See When to use (non)parametric test of homoscedasticity assumption?.

So it realy doesn't matter that I sometimes have only two replicates of a treatment? — Karelle Rheault, May 19 '19 at 20:30

Missing data and unbalanced data set in 2-factors design

1 Answers1