Confusion about estimation and prediction

Question

I have a very basic question, consider the following:

While learning to predict y from x, we are learning the joint distribution p(x,y). Now consider an estimation problem: suppose x follows a Gaussian distribution with unknown mu and known standard deviation, and mu follows a Poisson distribution with known parameters. So now if we want to estimate mu and then predict x, is it correct to say that we are learning the joint distribution of p(x,mu)?

Also, for learning to predict, are we actually learning p(x,y) or p(y|x)?

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

While learning to predict y from x, we are learning the joint distribution p(x,y).

Yes and no. If you use statistical model, then you assume some distribution for your data. On another hand, not every statistical method assumes such model. Take as an example ordinary least squares regression, the only thing that it does it minimizes the squared error, we do not need any distributional assumptions for that. It can be even more clear for some machine learning algorithms, for example $k$-nearest neighbors simply looks for "similar" observations in your data given some distance measure. So you can also learn patterns and similarities instead of distributions.

Now consider an estimation problem: suppose x follows a Gaussian distribution with unknown mu and known standard deviation, and mu follows a Poisson distribution with known parameters. So now if we want to estimate mu and then predict x, is it correct to say that we are learning the joint distribution of p(x,mu)?

Making it a little bit formal: you assume a Bayesian model with normal likelihood $P(X\mid\mu)$, known $\sigma$ and assume Poisson prior for $\mu$. In such case indeed you are estimating a posterior distribution $P(\mu\mid X)$. Notice that if you want to estimate joint of conditional distribution of data and parameter, then you need to assume that your parameter is a random variable, so you need to use a Bayesian model. Otherwise parameter would need to be a fixed value and it wouldn't have a distribution. You can check this thread to learn more about difference between frequentist and Bayesian likelihood.

Also, for learning to predict, are we actually learning p(x,y) or p(y|x)?

It depends what you are actually learning... Moreover, if you know joint and marginal distributions, you can use them to obtain conditional probabilities and vice versa since

$$ P(A \cap B) = P(A\mid B) \;P(B) \quad \text{and} \quad P(A\mid B) = \frac{P(A \cap B)}{P(B)}$$

thanks @Tim for your wonderful answer. could you please elaborate a bit why the parameter has to be a random variable in case of joint distribution? — Rakib, Oct 28 '16 at 22:15
@Rakib because joint distribution is a distribution of several random variables. Constants do not have distributions. — Tim, Oct 29 '16 at 07:19

Confusion about estimation and prediction

1 Answers1