While learning to predict y from x, we are learning the joint
distribution p(x,y).
Yes and no. If you use statistical model, then you assume some distribution for your data. On another hand, not every statistical method assumes such model. Take as an example ordinary least squares regression, the only thing that it does it minimizes the squared error, we do not need any distributional assumptions for that. It can be even more clear for some machine learning algorithms, for example $k$-nearest neighbors simply looks for "similar" observations in your data given some distance measure. So you can also learn patterns and similarities instead of distributions.
Now consider an estimation problem: suppose x follows a Gaussian
distribution with unknown mu and known standard deviation, and mu
follows a Poisson distribution with known parameters. So now if we
want to estimate mu and then predict x, is it correct to say that we
are learning the joint distribution of p(x,mu)?
Making it a little bit formal: you assume a Bayesian model with normal likelihood $P(X\mid\mu)$, known $\sigma$ and assume Poisson prior for $\mu$. In such case indeed you are estimating a posterior distribution $P(\mu\mid X)$. Notice that if you want to estimate joint of conditional distribution of data and parameter, then you need to assume that your parameter is a random variable, so you need to use a Bayesian model. Otherwise parameter would need to be a fixed value and it wouldn't have a distribution. You can check this thread to learn more about difference between frequentist and Bayesian likelihood.
Also, for learning to predict, are we actually learning p(x,y) or
p(y|x)?
It depends what you are actually learning... Moreover, if you know joint and marginal distributions, you can use them to obtain conditional probabilities and vice versa since
$$ P(A \cap B) = P(A\mid B) \;P(B) \quad \text{and} \quad P(A\mid B) = \frac{P(A \cap B)}{P(B)}$$