Recurring problem with retrospective data collection study designs I'm seeing

Question

I've noticed a lot of medical research that I am involved in goes as follows:

Collect data on 300-1000 patients, including all sorts of baseline characteristics such as BMI, age, gender and then outcome related statistics, so say our outcome is "fracture after operation", we could have angle of fracture, fracture density, pain scores, mobility scores, quality of life scores etc. etc. and then finally our outcome, whether or not the patient had a fracture after the operation. Often these outcomes are binary and the goal is to see if any of the independent variables are associated with fractures.

Now the problem here is we have a binary outcome variable and we often end up with about 30-50 patients who actually had a fracture out of 1000 patients, so the statistics are quite skewed and a lot less powerful than if 500 of the patients had fractures.
The 2nd problem is we have maybe 50 independent variables of diverse types, factors, continuous, binary (am I correct to assume that in these cases p>N due to the outcome variable only encompassing, say 30 patients, even though the study size N is 1000?)
The 3rd problem is these are often studies made with little previous knowledge on the subject, so it's often hard to manually pick confounders by expert opinion.

Obviously we can't run a large multiple regression with all variables as the model overfits. We can't run 50 (independent variables) multiple regression analyses controlling for say age and gender, because we quickly run into a very grim multiple comparison problem.

We can't use regularization models because we are interested in all 50 variables and whether they are associated with our outcome (none are deemed simply controls, which regularization models choose from but do not necessarily add to the model).

From a statistical viewpoint, what would be your way of handling such a study design? Currently I just run logistic regression models controlling for patient characteristics and am transparent with the fact that the p-values are unadjusted.

I should note that these studies are not meant to invent a new method of treatment or change protocols, they are used to see what variables are of interest for future research.

Have you considered hiring a statistician? Anyway, these are not simple problems with simple solutions. 1) High data imbalance, 2) high-dimensional data (sort-of), 3) noise variables, 4) multiple comparisons. There isn't a clear approach to solve this, it depends on the data. — user2974951, Jan 20 '20 at 12:10
We have statisticians available to us. They suggest pooling variables with P<0.2 from a univariate regression of all variables, into a new multiple regression and report the variables under alpha in that 2nd regression model. With 2 dependent variables (say fracture before and after surgery) this would quickly make the multiple comparison issue glaring so I wanted to hear if anyone had something magical here... — Paze, Jan 20 '20 at 12:12
When you say it "depends on the data" is there any way I can better describe the data in my post? — Paze, Jan 20 '20 at 12:15
" They suggest pooling variables with P<0.2 from a univariate regression of all variables, into a new multiple regression". You may need to find new statisticians (maybe that's why you are here?). That is s really bad idea. — Joe King, Jan 20 '20 at 13:07
That is partly why I am here, yes. I don't trust in these methods. — Paze, Jan 20 '20 at 15:06
The strangest thing is these papers do get published, using this method. Worst perpetrator I have seen, that got published mind you, was using this method in unison with stepwise regression. — Paze, Jan 20 '20 at 15:07
Welcome to the world of statistics in medicine. Sadly you have only seen the tip of the iceberg. Colleagues of mine estimate that up to 90% of published work in this field suffers from problems such as these. It is good to see a clinician taking a proper interest in these matters, not just paying it lip service. You are few and far between. Contact me via LinkedIn (see my profile page on here) if you are interested. — Robert Long, Jan 20 '20 at 15:46

score 9 · Accepted Answer · answered Jan 20 '20 at 13:24

You are right that this is a very common scenario in medical research.

"I should note that these studies are not meant to invent a new method of treatment or change protocols, they are used to see what variables are of interest for future research."

OK, I take this to mean that you are interested in causal inference, not in prediction.

And from the comments:

"We have statisticians available to us. They suggest pooling variables with P<0.2 from a univariate regression of all variables, into a new multiple regression and report the variables under alpha in that 2nd regression model."

This is not advisable. For one thing, mediators will be associated with the outcome, which you should not be adjusting for. You might also end up adjusting for colliders and actually invoking otherwise non-present confounding. See here for things that can go wrong when including variables that have no business being in a regression model.

I am sorry to say that there is no substitute for expert knowledge about the subject matter when it comes to causal inference. It is really as simple as that. "Expert" is relative term. You don't have to have a PhD in the field. I thought I read in another post that you are a medical doctor nearing the end of your training. I would have thought that you would be able to come up with a plausible DAG for many scenarios. I have been involved in teaching these things to undergraduate medics for a number of years and I usually find that they are able to construct plausible DAGs quite well. It is normal for different people to come up with different DAGs because they make different abstractions and assumptions about the data. Also, when they are completely stumped they are usually able to find information online or from other resources to help and inform their DAGs.

Recurring problem with retrospective data collection study designs I'm seeing

1 Answers1