3

Is there a canonical example of data which are conditionally independent? In other words, $X_1,\ldots,X_p$ are mutually independent given $Y$. This is the foundational assumption of the naive Bayes classifier but it's not clear to me which data generating processes it is best-suited to.

I know it's easy to come up with examples where the $X's$ are defined as a function of $Y$. For instance, $X_i|Y=y \sim N(y,\sigma^2)$ for each $i$ is one obvious case. However, these examples feel unnatural because the $X's$ are typically covariates and $Y$ is typically a response (at least with the naive Bayes classifier) so it's awkward when covariates are defined as a function of the response.

I'm interested only in cases where $Y$ is defined as a function of the covariates. An example of this might be $Y=\underset{i \in \{1,...,p\}} {\operatorname{arg\,max}}\{X_i\}$, though this example doesn't satisfy conditional independence. Are there any good alternative examples which do satisfy the assumption?

jjet
  • 1,187
  • 7
  • 12
  • Are you looking for a dataset that can be reasonably assumed to have this property, or a theoretical example in terms of random variables? – khol Apr 25 '18 at 13:59
  • I'm only looking for a theoretical example where $Y$ is a function of the $X's$. – jjet Apr 25 '18 at 14:02

1 Answers1

2

However, these examples feel unnatural because the X′s are typically covariates and Y is typically a response (at least with the naive Bayes classifier) (...) it's not clear to me which data generating processes it is best-suited to.

This is not necessarily true. It's easier to think of the Naive Bayes set up as a model where the $X$'s are caused by $Y$. As a canonical example, imagine $Y$ is a disease and $X$ are the symptoms. Your task is to predict the disease (cause) from the symptoms (effects). Then the Naive Bayes model would look like:

enter image description here

And in this case the $X$'s would be independent conditional on $Y$. Note however, that even in this case the Naive Bayes model is still very simplified, and it suffers from what is usually called the "single-fault" assumption.

I'm interested only in cases where Y is defined as a function of the covariates.

It's easy to see graphically that this will be very hard to do. If you draw a DAG where the $X$'s are the parents of $Y$, this is what you have now.

enter image description here

You can see that if you condition on $Y$ we would expect to make all $X$'s dependent, not the opposite. Thus, to create independence this way, you would need to fine-tune the parameters in order to make the $X$ independent conditional on $Y$, despite this structure. That is, you will have to generate a distribution that is unfaithful to the graph.

Thus, if you are trying to predict a consequence from its causes using Naive Bayes, this is a set-up where it's almost guaranteed by construction that the conditional independence of the $X$'s given $Y$ will not hold.

As requested, for an example where conditioning on $Y$ might turn $X_1$ and $X_2$ independent, consider a multivariate normal model, where all variables have unit variance. Let $R_{x_1x_2.y}$ denote the regression coefficient of $X_1$ on $X_2$ controlling for $Y$. We want this coefficient to be zero, and it suffices to make $\sigma_{x_1x_2} = \sigma_{yx_1}\sigma_{yx_2}$. For a numerical example, consider the model $X_1 = U_{X_1}$, $X_2 = 0.25 X_1 + U_{X_2}$ and $Y = 0.4X_1 + 0.4X_2 + U_Y$ where the variances of the disturbances $U$ are adjusted to make the variables have unit variance. This will lead to $\sigma_{x_1x_2} = 0.25$ and $\sigma_{yx_1} = \sigma_{yx_2} = 0.5$, thus $\sigma_{x_1x_2} = \sigma_{yx_1}\sigma_{yx_2} \implies R_{x_1x_2.y} = 0$.

Carlos Cinelli
  • 10,500
  • 5
  • 42
  • 77
  • "this is a set-up where it's almost guaranteed by construction that the conditional independence of the X's given Y will not hold." Is there any formal/mathematical way to express this? – khol Apr 25 '18 at 22:36
  • Also, "almost guaranteed" acknowledges there may be room for a counterexample. And that's exactly what I'm looking for. Like I said, it's easy to construct conditional independence examples where $X's$ depend on $Y$. I'd like to know if/how the dependence can be reversed. – jjet Apr 26 '18 at 15:03
  • 1
    @jjet graphical models are a good formalism to understand why conditioning on a common effect will make causes dependent. You can create examples where conditioning on $Y$ can make variables independent by fine tuning parameters, I put an example with a multivariate normal in the answer. – Carlos Cinelli Apr 27 '18 at 02:10
  • Thank you. I toyed around with the multivariate normal for a while but couldn't come up with the right example. Yours totally works. I guess more generally, if we have $p$ $X's$, then the covariance matrix of $(X_1, ...,X_p,Y)$ should be such that $\sigma_{i,i}=1$, $\sigma_{i,j}=\sigma_{j,i}=\theta$, if $j=p+1$ and $i \ne p+1$ and $\sigma_{i,j}=\theta^2$ otherwise. Then, we get $X_1,...,X_p$ are independent given $Y$. This is just what I was looking for. – jjet Apr 27 '18 at 16:11