Prediction of change in response behavior in survey

Question

I am working with a longitudinal survey with responses from three separate occasions, so far. It follows individuals in childhood and adulthood. One of the questions asked is: "Has your child ever been diagnosed with x" (When the child's age is < 17 years) or "Have you ever been diagnosed with x" (When the child became older than 17 years).

Though the participants are asked an "ever" question, the answer sometimes changes from a positive answer to a negative one. This should, hypothetically, not be possible.

In my department there are different ad-hoc hypotheses as to why this might be the case. E.g. it was a wrong diagnosis so it is not reported anymore, the condition became better so it is not reported anymore and of course sometimes the reason might be the change of respondent (from parent to child itself). However there might be tons of other reasons.

So my idea is to use a more explorative approach and also using machine learning methods to find the best predictors (in our data set) for a change in the response from a positive "ever" diagnosis to a negative one. I was thinking about elastic net logistic regression and random forest for this task. And I am looking into some techniques to find best predictors.

However so far I was thinking in a framework for cross-sectional analysis. E.g doing a regression with most predictors from the year where suddenly no diagnosis is reported anymore. Though it might be more accurate to think about it in a longitudinal framework, because it might not be an absolute value which predicts the outcome but e.g. the change from t0 to t1.

My dataset contains a lot of features and as I am doing an explorative research and not testing one theory against another I would like to "test" a lot of features and then filter for the most predictive ones, taking into account that some(!) features are measured at different time points in an automated process. A good addition would be if the approach is able to deal with missing data, as not all participants report at all three time points and I don't want to define a multiple imputation model for evey feature in the data set.

In short:

How can I find the best predictors of an event taking into account that the data was collected at three time points repeatedly (but also sometimes only once).
How can I deal with missing values without needing to define an impuation model for every feature?

What I want to predict

D - response: ever had diagnosis N - response: never had diagnosis NA - Missing

From positive to negative diagnosis - includes the following patterns

D - NA - N
D - N - N
D - N - NA
NA - D - N
N - D - N

Positive diagnosis remains positive: includes the following patterns

D - D - D
D - NA - D
NA - D - D
NA - D - D

My ultimate goal is to understand better how to interpret this change in the answer to the question, so the model should be interpretable.

I'm not that familiar with techniques for longitudinal data analysis, but I think that a random forest approach would be a good first step in your exploratory analysis, if your dataset is large enough. Preferably leave yourself a genuinely separate testing dataset if you can. — Izy, Jul 05 '19 at 09:56
Missing values is tricky - could there be a bias for which values are missing? To reflect this, you could keep the 'NAs' as an actual category in your categorical variables. Or you could use an algorithm to impute them (but that comes with potential risks), or use a random forest algorithm that leaves them out altogether. It might be worth having a look at this thread: https://stats.stackexchange.com/q/98953/212689 — Izy, Jul 05 '19 at 09:56

Prediction of change in response behavior in survey

What I want to predict

0 Answers0