tl;dr: I found a better model than the one I first thought of while inspecting the data and performed a few steps of variable selection/model fine-tuning. I assume that this is a (mild) case of inference after model selection.
I performed an experiment that went wrong, but the error made it possible to test another hypothesis that I generated upon noticing the error (and before knowing the outcome). Parameter A should have been held constant but was not, so I hypothesized that Y depends on A, which is biologically plausible. Because of repeated measures and heteroskedasticity, I used the following least squares model:
gls(Y~A,
correlation = corAR1(form = ~1|individual),
weights = varPower())
While inspecting the data on the level of the individuals I noticed a group that behaved differently from the rest. This group corresponded to an actual group in the experiment, but I didn't expect this grouping factor (called B in the following) to be relevant at first. So I updated the model (AIC improved):
gls(Y~A*B,
correlation = corAR1(form = ~1|individual),
weights = varPower())
I also checked if other correlation structures and variance structures would lead to a better model by visually checking the residuals and ended up with this model (AIC improved):
gls(Y~A*B,
correlation = corAR1(form = ~1|individual),
weights = varIdent(form=~1|test)) # the same test was repeated five times in all individuals, B varied across the repetitions
Finally, I checked if two other covariates had any effect on the model, which I didn't expect and wasn't the case (AIC worsened, graphically investigated the effect), but I expected to be asked if it had been tested.
Possible solutions I thought of/read about:
- Warn the reader that one limitation of the study and the interpretability of the results is that model selection and inference was performed using the same dataset.
- Split the dataset (66% train - 33% test), however grouping factor B is not balanced, many more observations are from one of the groups - that could cause other problems - and I fear a loss of power since the dataset is rather small.
- Use the full model for inference (useless parameters and covariates included).
- Add noise to the data while redoing the model selection as I have read here: https://arxiv.org/pdf/1507.06739.pdf. However:
- this approach might not be so honest since I now know which model is best and might be biased while checking again if the elements that directed my decisions are still present in the noisy data
- this approach might be applicable only with specific model (forward stepwise) and variable (lasso) selecting methods
- I am unsure of how to implement this method (specifically the estimation of the noise variance and mean). I also found no R package doing this.
- Somehow adjust the results for post-selection inference, although I could not find a method applicable to the comparison of manually selected models (and I do not have the knowledge to adapt the conditional probabilities method described in this article to my needs)
Resulting questions:
- Does this actually constitute a case of inference after model selection or did I understand this wrong (eg. is this only relevant when choosing from a higher number of models and/or when performing variable selection)?
- Is the selection of the optimal correlation/variance structure also affected by this problem?
- Does any of the solutions to overcome this problem make sense or do you have another suggestion?
I apologize if this is completely wrong or unclear, I have limited experience in statistics/bioinformatics and no theoretical background...