1

Say, I have a biomarker that is strongly associated to a gene. This biomarker is also strongly associated to another trait, like glucose, but the gene is not.

  • If I perform a regression between the glucose and biomarker + gene I get the biomaker and gene both significant: Is this a spurious effect?
  • And what if I add an interaction term between the biomarker and gene on the glucose (biomarker*Gene+Gene+biomarker) and all the terms are significant?

What does it mean if when I add a third variable (biomarker) in the regression the second (gene) becomes significant all of the sudden? Does it mean that the second variable is then significantly associated with the dependent?

chl
  • 50,972
  • 18
  • 205
  • 364
Alex
  • 13
  • 4
  • 2
    Are you familiar with mediating http://en.wikipedia.org/wiki/Mediation_(statistics) or moderating http://en.wikipedia.org/wiki/Moderation_(statistics) variables? Darn I cannot get those end brackets to form part of the link. – Michelle Feb 04 '12 at 03:02
  • Collinearity (which is often related to the topics Michelle mentioned) is a possible explanation. I suggest having a look at some of the dozens of questions on here related to this subject. – Macro Feb 04 '12 at 05:19

1 Answers1

5

The most likely explanation is that the Biomarker is a suppressor variable. A suppressor variable is correlated with another predictor variable in such a way that the predictor is significant when both are entered into a model, but not when it is entered alone. Unfortunately, suppression is just one of those statistical phenomena that aren't very intuitive. This website is fairly long, but very clear and includes a discussion of all the relevant issues with a section on suppressor variables at the end. I also found this American Statistician paper, which is specific to suppressor variables. I haven't read it yet, but it looks quite good.

Another possibility is that the Biomarker is not a suppressor, but it accounts for enough of the residual variance in your response variable (glucose), that the weaker gene - glucose relationship becomes significant. Remember that 'significance' is assessed by the relationship between the variability that a predictor accounts for, and the residual variability. If the Biomarker accounts for a good deal of what would otherwise be residual variability, but consumes only, for example, 1 degree of freedom, this could increase the power of your analysis with respect to the gene. Under this interpretation, you would have simply needed more data to resolve the gene - glucose relationship, but there might not be any correlation between the gene and the Biomarker.

In neither case would it be correct to call this a spurious correlation. A spurious correlation is when there is a zero-order correlation between two variables, but no direct relationship. The classic situation is where two variables A and B are both caused by a third variable, C, but otherwise have no direct connection. A real-world example I once heard is that when the economy speeds up, it enhances both the birth rate and steel production, but that there is no direct connection between them.

An interaction is a third, distinct concept. An interaction obtains when you would describe a situation using the word 'depends'. For instance, if someone asked what is the effect of taking the birth control pill, you might say:

It depends, for women, it suppresses ovulation and so reduces the chance of pregnancy. But for men, since they don't ovulate, it has no effect.

(I acknowledge that this is a rather forced example.)

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • thank you, this is very clear. The gene has biologically to do with Glucose, but not directly associated (i tested it in a sample size of ~60.000 subjects). However, the gene is regulating a biomarker (something like Insulin) that is associated (and biologically relevant) to glucose levels and is the gene is significant if i add them both in the model, also their interaction. – Alex Feb 04 '12 at 16:55
  • I like these replies. These are very fundamental and fruitful topics. You also may want to look up the terms *partial correlation* and *statistical control*. – rolando2 Feb 05 '12 at 22:10