I'm getting lost on the inference part of Gaussian process regression:
We start with a dataset $X \in \mathbb{R}^{n \times d}$ with corresponding observations $f = f(X)$ and some test data points $X_* \in \mathbb{R}^{n_* \times d}$ for which we want to infer function values $f_* = f(X_*)$.
The $GP$ defines a joint distribution for $p(f, f_*|X, X_*)$
$$\begin{bmatrix}\textbf{f}\\ \textbf{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\mu \\ \mu_*\end{bmatrix}, \, \begin{bmatrix}K & K_*\\ K_*^T & K_{**}\end{bmatrix}\right)$$
with ${\mu} = m(X)$, $\mu_* = m(X_*)$, $K = K(X,X) \, , \, K_* = K(X, X_*) \, , \, K_{**} = K(X_*, X_*)$.
then we apply rules of Multivariate normal conditioning to infer
$$p(f_* | f, X,X_*) \sim \mathcal{N}(\mu_*+K_*^TK^{-1}(f-\mu) \, , \,K_{**}-K_*^TK^{-1}K_*)$$
Until here all makes sense since any subset of uncountable family of gaussians has still a joint gaussian distribution, but:
Are information on the right hand side all known to us, i.e. deriving directly from the test set $X_*$ for which the evaluation $f(X_*)$ are unknown?
What I don't understand, for example, is that by definition:
$$\mu_* = m(X_*) = \mathbb{E}[f(X_*)]$$
and while for known tuples $(x, f(x))$ this actually makes sense, instead we don't know the values $f(X_*)$! What am I missing here?
many thanks,
James