"ANOVA on a non-random non-Normal sample from a Normal Population"

Question

How can I run ANOVA or tests for statistical significance on a bi-modal sample that came from a normal population?

Context:

I was tasked with running an ANOVA to see if genotypes (treatments / factors) had an effect on phenotypes (responses) in a simple fixed-effects model. The standard is to apply a fixed-effects linear model on each genotype to phenotype, run ANOVA, then a Tukey's HSD. See here for more context.

Unfortunately, out of the 416 samples phenotyped, only 356 were genotyped (~85%). Furthermore, the samples genotyped were non-random; most of the excluded samples came from the peak, shifting the samples from a normal distribution to a bi-normal distribution.

I know ANOVA doesn't apply when we have a non-normal population.

To fix this, would I use bootstrapping or parametrization? How would I set that up?

Mock-data to show extent of sampling bias (in R):

    data_pre_sample <-  structure(list(phenotype = structure(1:9, .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), 
                               frequency = c(4, 16, 48, 108, 116, 88, 32, 4, 0)), 
                          .Names = c("phenotype", "frequency"), row.names = c(NA, -9L), class = "data.frame")
    data_post_sample <- structure(list(phenotype = structure(1:9, .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), 
                               frequency = c(4, 16, 48, 108, 64, 80, 32, 4, 0)),
                          .Names = c("phenotype", "frequency"), row.names = c(NA, -9L), class = "data.frame")

Phenotypes:
- 1, 2, 3, 4, 5, 6, 7, 8, 9

Pre-selection phenotype count:
- 4, 16, 48, 108, 116, 88, 32, 4, 0

Post-selection phenotype count:
- 4, 16, 48, 108, 64, 80, 32, 4, 0

Sources Consulted:

Explaining to laypeople why bootstrapping works
Assumptions regarding bootstrap estimates of uncertainty
Difference between ANOVA and permutation test
- I read the first and third paper mentioned here as well
Correcting biased survey results
Fixing a biased (deliberately) sample
http://www.stat.cmu.edu/~cshalizi/uADA/13/lectures/which-bootstrap-when.pdf

I apologize if my question isn't clear; I am relatively new to stats stackexchange and I am just refreshing and expanding my knowledge in statistics.

Updates June 19th

The underlying distribution of our effects should be Normal. Effects were on a 1 - 9 scale (1 being the lowest performing, 9 the best performing). Geneticists were interested in which treatments (genotypes) corresponded to the worst performing and high performing, so they excluded taking measurements in the middle-performing (5). It is also safe to assume many of the treatments will have no correlation to the phenotype (so far I put in 739 treatments).

How would I implement sample weights or corrections? Bootstrapping residuals, or am I going down the wrong path entirely?

Which population are you interested in? The original population, or the genotyped population? I would be much less concerned with non-normality than the apparent systematic sampling bias (i.e. which experimental units were genotyped) which prevents you from generalizing results to the wider population. It sounds like you need to correct for the sampling bias, which will require more information about how it was decided to (or not to) genotype experimental units. — khol, Jun 18 '18 at 23:24
I believe (but am not certain) I'm interested in the original population. I have a second population (with their genotypes) that doesn't have this sampling bias (but the normal distribution is shifted to the right). The sampling bias, to my limited understanding, was because the geneticists were interested in the phenotype's extremes (the 1s and 9s). They assumed that cutting out the middle would result in more genotypes of the traits that demonstrated (or lacked) the phenotype. — A Duv, Jun 19 '18 at 01:52
I expected you want to make inference on the original population. If you have a second experiment/dataset without sampling bias (a random sample of the population you wish make inference on), it would likely be preferable to use that. Non-normality is much more easily handled than sampling bias, particularly if you don't have much information about how the sampling was performed. — khol, Jun 19 '18 at 02:06
While I do have a second experiment, it needs to be curated a LOT still (super noisy clustering, even when using HDBSCAN. Clustering was done to find the `genotypes`). I was hoping to run this ANOVA as a preliminary test to see whether it was worth curating the other population. This population has sampling bias literally because the geneticists couldn't genotype all the samples and decided to cut off the ones in the middle so they could have more samples on the extremes (another standard in biology apparently). — A Duv, Jun 19 '18 at 02:23
It would be helpful to have more information about the sampling bias & any assumptions about the underlying distribution so that we may impute or correct for the missed measurements. — khol, Jun 19 '18 at 03:22
What sort of information on sampling bias should I look into, and what additional information should I provide (if possible)? — A Duv, Jun 19 '18 at 16:52

"ANOVA on a non-random non-Normal sample from a Normal Population"

Updates June 19th

0 Answers0