-1

There is a dataset/study that has two independent binary variables, and a continuous dependent variable.

Variable A    Variable B    Measurement C
1               1               32.4
0               1               29.1
1               0               15.8
1               1               25.9
...

The binary variables were manipulated by the researchers. This simple study was observing animal behavior, by varying two binary variables in their environment, and then taking a measurement (continuous / ratio).

1. A and B only occur during a time when this animal behavior (C) occurs, and thus can be measured. The behavior does not occur when A and B are absent.
2. A and B always occur together, but they are independent.
3. The animal behavior being measured (C), can't affect either A or B.

Is it valid to take this dataset and use logistic regression to:

  1. Independent variable is C; dependent variable is B.
  2. Independent variables are C and A; dependent variable is B.

I can write a null hypothesis that says the odd ratio is 1, and use the data.

Mathematically, you can do this, but is the causal inference biased?

  • It's really unclear what you're asking, and what you wrote in the title doesn't correspond to what you wrote in the question text. What is your research question, and what is the alternate way you are thinking of it? What does `->` mean your title and text? Please make this question clearer and we will be able to help you better. – Noah Jul 16 '20 at 01:52
  • Okay. Give me a minute to work on that. –  Jul 16 '20 at 01:53
  • @Noah - Updated. –  Jul 16 '20 at 01:59
  • If you construct a hypothesis from the collected data, and then test the hypothesis on the data (surprise! the data support the hypothesis!) that is clearly the wrong way forward. Why were the variables collected in the first place (what was that hypothesis)? Why isn't it being tested with the data. – Michelle Jul 16 '20 at 02:17
  • @Michelle I think OP is using the word "result" to mean "outcome variable". They are asking whether it makes sense to regress experimental condition on the observed outcome. It's not about the sequence of hypothesis generation and data collection. – Noah Jul 16 '20 at 02:34
  • @Michelle - You are assuming A and or B predict C. But assume they do. A, B, and C all occur in nature. So if you measure C, can it tell you about A and B? But how valid is that? There are two issues: does it tell you something outside of the actual data used (the reverse test), and is it valid to test the original test another way, by looking at it in reverse, using C to tell you about B. –  Jul 16 '20 at 03:18
  • @Noah I see your point. The question is so vaguely worded. – Michelle Jul 16 '20 at 20:42

2 Answers2

0

You can do this, but it wouldn't tell you much. I could imagine you might be interested in predicting environmental conditions from animal behavior if it was cheap to observe the behavior and expensive to measure the environment (e.g., using a canary in a coal mine), in which case it might make sense to model the environmental condition using the animal behavior. You wouldn't know the baseline prevalence of the environmental conditions, however, so you wouldn't be able to predict its specific value from the measured behavior of animals in the wild, making the model close to useless.

In science, though, we are often interested in causal relationships, which are associations that have specific properties (temporal precedence, no confounding, etc.). The odds ratio for the relationship between the environmental condition and animal behavior is a measure of association that is free of confounding, but it doesn't represent the temporal precedence of the relationship. The animal's behavior does not cause changes in the environment (or if it does, that is not the relationship you are investigating), so the odds ratio has no causal interpretation and doesn't tell you anything about how the environment would change if you forced changes in an animal's behavior or how the animal's behavior would change if you forced changes in the environment (like you did in the experiment). So this odds ratio would tell you nothing of interest.

The causal parameter of interest is likely the difference in the means of the animal's behavior between the environmental conditions. This parameter has a causal interpretation because it can be identified as the typical change in the animal's behavior that occurs when the environment is intervened upon (i.e., changed by a force external to the animal). This parameter is useful for science because it helps explain animal behavior, which the odds ratio in the previous model does not.

Finally, the second regression, with A predicting B, makes no sense at all because A and B are independent if manipulated separately by the experimenter, so the coefficient on A tells you absolutely nothing. In fact, the coefficient will likely be spuriously nonzero due to collider bias and should not be interpreted (see Elwert & Winship, 2014, for an explanation of this).


One interesting tidbit is the finding that if two groups are normally distributed with the same variance, the parameters in the logistic regression model predicting group membership from the continuous variable can be directly computed from the means and variances, implying a specific mathematical connection between the difference in means and the odds ratio. This was described here in a very clean and clear derivation and is worth a look.

Noah
  • 20,638
  • 2
  • 20
  • 58
  • A and B are independent of each other, they both occur in nature in either state (here, 0 and 1), and they can't be affected by the animal being observed. I will read your references tomorrow morning. Thanks. (I've updated the post with this information.) –  Jul 16 '20 at 03:22
  • Also, the animal can be observed (measuring C) independently of A and B being manipulated, and the natural state of A and B can be noted when taking the measurement. –  Jul 16 '20 at 03:43
  • ... and, the researcher expected A to moderate B. –  Jul 16 '20 at 04:36
  • I read a lot of that article (Elwert). It's very good. I'm not convinced it's relevant. Here, the environment will always consist of one of the four states used in the study. In Elwert's simple example, beauty and/or talent leads to acting success. Turning that around and saying acting success without beauty means there's talent is biased. Intuitively, there could be other explanations, such as charisma or charm - so the weight assigned to talent could be overestimated. -- That is not the case here. There are no other alternatives. I'm updating the description with this information. –  Jul 16 '20 at 17:05
  • 1
    $A$ and $B$ are independent in your experiment and cause $C$. Conditioning on $C$ to estimate the effect of $A$ on $B$, which is what you do when you include $C$ in a regression of $B$ on $A$, as in model 2, is the *definition* of conditioning on a collider, which is exactly what the article is about. – Noah Jul 17 '20 at 01:58
  • @Noah I'm not sure that analysis should be black and white. I think it's intended as a rule of thumb. A generalization is not being made outside of A and B being present. –  Jul 17 '20 at 02:38
-1

Great question. As A, B, C occur together or not at all, I don't think any of the bias categories are applicable.

Regarding logistic regression, if there is an interaction effect between A and B when predicting C, then the value of A would be useful in predicting B - when including C. Logistic regression relies on the linear separability of classes, so if there is no interaction effect, A won't be helpful in predicting B. (I'm not talking about any expansions beyond just using A, B, and C in a new logistic regression model. Whether there's a more effective modeling technique is a separate issue.)

Consider the following dataset, where there is an interaction effect (extreme) between A and B on C. C and A predict B.

A  B  C  
   
2  1  40  
2  0  20  
3  1  60  
3  0  30  
4  1  80  
4  0  40  
0  1  10  
...  ...  ...  
  • when you say "If there is no interaction effect, A won't be helpful, assuming A and B are independent (and they should be)." I don't agree. If you imagine a simple DAG with three nodes A, B, and C and two arrows starting at A and B, respectively, and both pointing to C, then A and B will be conditionally associated given C, even when (i) there is no marginal association between A and B and (i) and A and B do not have an interaction – psboonstra Jul 17 '20 at 03:00
  • @psboonstra The context is logistic regression, which works with classes that are linearly separable. If A isn't affecting B's contribution to C, and B isn't effecting A's contribution to C, how is A (or B) providing information about B (or A), given C? -- I didn't consider other techniques, such as neural networks, which could be capable of finding non-linearly separable associations between the three variables. (I updated my answer to limit my statement to logistic regression.) –  Jul 17 '20 at 15:11
  • I believe my comment still applies. I have in mind something like the DAG on [page 8, equation (29) here](http://www.cs.columbia.edu/~blei/fogm/2016F/doc/graphical-models.pdf). Matching notation, I am thinking that X and Z in that DAG respectively correspond to A and B in this example, and Y corresponds to C. Conditioning on C may induce correlation between A and B. – psboonstra Jul 17 '20 at 15:44
  • The same idea is illustrated [in Example 17.11, p271 of Wasserman's All of Statisticis](https://link.springer.com/content/pdf/10.1007%2F978-0-387-21736-9_17.pdf) – psboonstra Jul 17 '20 at 15:47
  • @psboonstra I'm not seeing something there there that can be picked up with logistic regression. Can you come up with a simple dataset demonstrating this? (With two binary columns, and a continuous column). –  Jul 17 '20 at 16:58
  • `n = 1e4` `a – psboonstra Jul 17 '20 at 17:14
  • 1
    a+b is defining an interaction. "if a=0 then rnorm(b)" , "if a=1 then rnorm(b+1)" . --- In my example dataset in my proposed answer, "if a=0 then b*10, elseif b=0 then a*10" elseif b=1 then a*20" . –  Jul 18 '20 at 03:27
  • Stackexchange is telling me that I should 'avoid extended discussions in comments', so if we're still not on the same page, we should move this discussion to a chat. Fortunately we've identified the source of our misunderstanding, namely that we have different understandings of the definition of an interaction. See [Wikipedia's section on Interactions](https://en.wikipedia.org/wiki/Interaction_(statistics)). I'm not saying anything about whether your generating model has an interaction or not; however, my true generating model does *not* contain an interaction. Happy to talk more over chat. – psboonstra Jul 18 '20 at 15:10
  • 1
    @psboonstra Your own reference defines your model as an interaction - "In statistics, an interaction ... describes a situation in which the effect of one causal variable on an outcome depends on the state of a second causal variable." As Jason said, if a=0 then c = rnorm(b); if a=1 then c = rnorm(b+1). It's more than an effect, 'a' defines how 'b' will be used. –  Jul 19 '20 at 01:51
  • I'm happy to talk more in a StackExchange chat I created: it's called Interactions and you can find it [here](https://chat.stackexchange.com/) I would first encourage you to scroll down to the Introduction section of the Wikipedia page I linked to, which contrasts an example of a model with two explanatory variables without interactions (which I am using) against a model with two explanatory variables with an interaction. Hopefully that clears up our confusion. – psboonstra Jul 19 '20 at 02:19