I have genetic SNP data with 50 controls and 120 cases. I am said that comparing few controls with more number of cases will cause spurious associations. The analysis that I want to do is just a logistic regression, where the outcome is disease (case/control) and the predictors are the genotypes (AA,AB,BB). I had a thought of comparing the 50 controls with randomly chosen 50 cases and then replicating the finding by comparing with remaining cases. Can anyone comment on this scenario.
Asked
Active
Viewed 219 times
1
-
1Who said that having 120 cases for 50 cases is bad? I don't see why it would cause any spurious associations. However, there might be other issues (not the N, per se) such as the controls being different than the cases in some way. – Peter Flom Jun 02 '14 at 13:02
-
Down-sampling gives consistent estimates of odds ratios, so it can be handy to sacrifice a little precision when you've so many cases that they're a burden on your computer's memory or processor, but that hardly applies in this case. See [here](http://stats.stackexchange.com/questions/67903/). If 50 controls aren't enough for your model you can't fix things by throwing away cases. – Scortchi - Reinstate Monica Jun 02 '14 at 14:15
-
BTW don't you mean "comparing the 50 controls with 50 randomly chosen cases"? – Scortchi - Reinstate Monica Jun 02 '14 at 14:18
-
Yeah I meant that only. I edited the question now. – Veera Jun 02 '14 at 15:52
-
Thanks. Do you have any reference for the "spurious associations"? – Scortchi - Reinstate Monica Jun 02 '14 at 15:54
-
No I don't. I don't know if thats true – Veera Jun 02 '14 at 16:01
-
It's not. If you could sample only 170 genotypes based on outcome you'd usually choose to maximize power by sampling 85 cases & 85 controls - perhaps that's what you're thinking of. – Scortchi - Reinstate Monica Jun 02 '14 at 16:14