5

In supervised learning, we refer to the regressors as independent variables and response variables as dependent, but from a probabilistic standpoint, I am having trouble understanding this.

To breakdown my confusion, I think it makes sense to consider two separate cases (1) regressors are fixed / constant / deterministic (2) Regressors are random variables


(1)

Constants can also be viewed as random variables. We know from probability theory that a constant random variable is independent of any other random variable and we also know that independence is symmetric. So if $X$ is independent of $Y$, then $Y$ is independent of $X$. You can see this easily from conditional probability $P(X,Y) = P(X|Y)P(Y) = P(Y|X)P(Y)$. So if $X$ is independent of $Y$, then we have $P(X|Y) = P(X)$. So $P(Y|X)$ must be $P(Y)$.

But how does this make sense in the context of supervised learning? We assume that $Y$ is dependent on $X$, but not vice versa?


(2)

The same idea holds as the above except $X$ is no longer fixed here.

24n8
  • 847
  • 3
  • 13

1 Answers1

4

The "dependent" and "independent" terminology for the variables is unfortunate terminology, which is best avoided. Statistical dependence is always bidirectional ---i.e., if a variable is statistically dependent on another variable, then that second variable is also statistically dependent with the first variable. In a regression model the two variables are posited to have a statistical relationship. We treat the explanatory (regressor) variables $\mathbf{x}$ as fixed and we model the regression function $u(\mathbf{x}) = \mathbb{E}(Y|\mathbf{x})$, which is the conditional expected value of the response (regressand) variable $Y$. See this related question for more discussion on the unfortunate terminology.

Ben
  • 91,027
  • 3
  • 150
  • 376