I am working with a longitudinal survey with responses from three separate occasions, so far. It follows individuals in childhood and adulthood. One of the questions asked is: "Has your child ever been diagnosed with x" (When the child's age is < 17 years) or "Have you ever been diagnosed with x" (When the child became older than 17 years).
Though the participants are asked an "ever" question, the answer sometimes changes from a positive answer to a negative one. This should, hypothetically, not be possible.
In my department there are different ad-hoc hypotheses as to why this might be the case. E.g. it was a wrong diagnosis so it is not reported anymore, the condition became better so it is not reported anymore and of course sometimes the reason might be the change of respondent (from parent to child itself). However there might be tons of other reasons.
So my idea is to use a more explorative approach and also using machine learning methods to find the best predictors (in our data set) for a change in the response from a positive "ever" diagnosis to a negative one. I was thinking about elastic net logistic regression and random forest for this task. And I am looking into some techniques to find best predictors.
However so far I was thinking in a framework for cross-sectional analysis. E.g doing a regression with most predictors from the year where suddenly no diagnosis is reported anymore. Though it might be more accurate to think about it in a longitudinal framework, because it might not be an absolute value which predicts the outcome but e.g. the change from t0 to t1.
My dataset contains a lot of features and as I am doing an explorative research and not testing one theory against another I would like to "test" a lot of features and then filter for the most predictive ones, taking into account that some(!) features are measured at different time points in an automated process. A good addition would be if the approach is able to deal with missing data, as not all participants report at all three time points and I don't want to define a multiple imputation model for evey feature in the data set.
In short:
How can I find the best predictors of an event taking into account that the data was collected at three time points repeatedly (but also sometimes only once).
How can I deal with missing values without needing to define an impuation model for every feature?
What I want to predict
D - response: ever had diagnosis N - response: never had diagnosis NA - Missing
From positive to negative diagnosis - includes the following patterns
D - NA - N
D - N - N
D - N - NA
NA - D - N
N - D - N
Positive diagnosis remains positive: includes the following patterns
D - D - D
D - NA - D
NA - D - D
NA - D - D
My ultimate goal is to understand better how to interpret this change in the answer to the question, so the model should be interpretable.