In a regression setting, one wants to identify some model of a process of interest, based on noisy measurements. The model usually goes like this: $$ y_i = f(x_i, \theta_1) + \varepsilon_i.$$ Here, $y_i$ and $x_i$ are measurements, $\varepsilon_i\sim P_\varepsilon(x_i, \theta_2)$ is noise (summarizing all kinds of disturbances), and $\theta_1, \theta_2$ are parameters to be identified by the regression procedure. $P_\varepsilon(x_i, \theta_2)$ is the distribution of the noise, which is sometimes also to be learned from the data and might depend on the value of $x$.
A usual approach in ML appears to be to take
- a super complex and over-parameterized model $f(x, \theta_1)$ for the process, e.g., a deep neural network, and
- a super simple and under-parameterized model $P_\varepsilon(x, \theta_2)$, e.g., $P_\varepsilon(x, \theta_2) = \mathcal{N}(0, \sigma_\varepsilon^2)$ with only the (constant) variance to be learned.
Now I know that there are many approaches with more complex noise models (Gaussian process regression comes to mind), but the above appears to be a standard approach to me.
Questions:
- Is my impression of what's currently "usually" done in deep learning wrong? Do people routinely use complex noise models? If so, which / how?
- If the above depiction is indeed somewhat correct, how is it possible to separate signal and noise in any meaningful way, if the model of the noise is so unrealistic? For instance, it should have an effect on the estimate of $f(x)$ whether measurements in a certain region should be assumed reliable or not (since that scales the influence of the prior over $f(x)$).