Why isn't every nonparametric model with random model design an additive noise model?

Question

Let $Y$ be a real random variable and $X$ be a real random vector. In a nonparametric model with additive noise, we assume the relationship

$$Y = f(X) + \epsilon$$

for some unknown regression function $f$ and noise $\epsilon$. This assumption is in contrast to the general nonparametric model, where no assumption about the additivity about the noise is made. Now, I'm wondering why not every model can be written in the former form.

We can always take $f(x) := E[Y \,|\, X = x]$ and $\epsilon = Y - f(X)$. This gives the form

$$Y = f(X) + \epsilon$$

Moreover, we find $E[\epsilon] = 0$ and $E[X\epsilon] = 0$. Am I missing some assumption for the nonparametric regression model with additive noise that is not satisfied here? Otherwise, it seems to me, that the general nonparametric regression model and the nonparametric regression model with additive noise are equivalent.

Check the first bullet in this document: https://www.stat.cmu.edu/~ryantibs/statml/review/modelbasics.pdf — Zen, Jun 19 '21 at 14:53
@Zen Thank you for the Link! This does seem to agree with my perception of nonparametric regression, right? I'm a little bit confused by the answer below. — rkvymvqt, Jun 19 '21 at 16:50
@Zen The linked document seems to agree also that the nontrivial assumption in the _additive noise_ model is just the independence of the error term $\epsilon$ and $X$. Therefore, a model is an additive noise model iff $\epsilon$ and $X$ in the above decomposition are independent, right? — rkvymvqt, Jun 19 '21 at 16:56
The document above assumes you hope to minimize the MSE and uses the Hilbert space projection theorem to find the minimizer $E[Y|X=x]$ under the assumption that the errors are orthogonal. — Ariel, Jun 19 '21 at 16:58
Note, that minimizing MSE is perfectly natural and fits well with the projection interpretation of the conditional expectation but it does constitute adding some additional structure. — Ariel, Jun 19 '21 at 17:06
@Ariel When I understand the document correctly, the objective of minimising the MSE is introduced _after_ the initial observation. — rkvymvqt, Jun 19 '21 at 17:14
I am pretty sure in their first bullet point they note they are trying to minimize the squared error. — Ariel, Jun 19 '21 at 17:20
@Ariel You are totally right! The first bullet point contains references to the MSE. But that is just an objective, not an assumption right? So no assumption was used to deduce the representation. — rkvymvqt, Jun 19 '21 at 17:24
Right, but I think choosing this objective may/will have non-trivial implications for the estimation procedure, i.e. the additive error form. — Ariel, Jun 19 '21 at 17:26

Ariel · Answer 1 · 2021-06-19T18:22:54.613

1

Edit: Based on some discussion with rkvymvqt from a purely agnostic perspective should always be able to write,

$$Y=f(X)+\epsilon$$

By simply defining $\epsilon = f(X)$ and indeed by using the Doob-Dynkin lemma we can consider $f(x)=E[Y|X=x]$. In some sense, this is just like writing $Y=X+(Y-X)$. I think the issue that we as statisticians are interested in is more in the interpretation and recovery of $f$ from $(Y,X)$. In that case, writing $f$ like this does restrict our interpretation of it and so perhaps it is not as general as finding the "true" $f$ that fits $Y=f(X)$ (without error). Thus, the original answer below is more a comment on how we define the joint relationship for the interpretation of our model. Note, this does not mean that the two models are equivalent because we are recovering two different functions but does mean we can always represent a general model as a nonparametric regression in the case that we do not care about the interpretation of $f$.

Original Answer:

Yes, you are making a critical assumption that the error is additive. That is a functional form assumption. You are defining the conditional expectation to be,

$$E[Y|X=x]=f(x)$$

This is called a nonparametric regression model or equivalently an additive noise model. We could just as easily believe that our model should have a multiplicative error structure,

$$Y=f(X)\epsilon$$

See for example this paper. In this case we would have,

$$y_i = E[y_i|X=x_i]\epsilon_i$$

In the most general sense the nonparametric model is written as,

$$\{P_f:f\in\mathcal{F}\}$$

Where $\mathcal{F}$ is some infinite-dimensional parameter space and our data was generated by the probability distribution $P_f$ for some parameter $f$.

edited Jun 19 '21 at 18:22

answered Jun 19 '21 at 14:49

Ariel

2,273
2
23

Thank you for your answer! How am I _defining_ the conditional expectation to be, $E[Y \,|\, X = x] = f(x)$. Doesn't this follow from the [Doob–Dynkin lemma](https://en.wikipedia.org/wiki/Doob–Dynkin_lemma)? – rkvymvqt Jun 19 '21 at 16:47
Doob-Dynkin allows us to condition on a random variable via conditioning on the $\sigma$-algebra generated by that random variable. That does imply that we could write something like $E[Y|X]=f(X)$ but this is a representation that depends on the joint distribution of $X$ and $Y$. So, the addition of an additive error here is an assumption that is driven by the loss function we are considering and how we model the DGP. – Ariel Jun 19 '21 at 17:03
Thank you for helping to me understand this topic! What do you mean by _[...] this is representation that depends on the joint distribution of $X$ and $Y$_? The representation $Y = f(X) + \epsilon$ also depends on the joint distribution. – rkvymvqt Jun 19 '21 at 17:15
The paper you linked also explains that the multiplicative error structure $Y = f(X) \epsilon$ can be expressed in an additive model. – rkvymvqt Jun 19 '21 at 17:17
Right, but you take a stand on how that joint distribution is expressed. And sure, if you take logs of $Y=f(X)\epsilon$ you would have an additive structure or you could use the transformation that they give. However, I could also write $Y=f(X,\epsilon)$ as a potential model. – Ariel Jun 19 '21 at 17:19
For some given $f$ you could surely find a way to separate $\epsilon$ but that also implies some transformation of the model. For purposes of estimation and inference this might not be a trivial transform of the data. – Ariel Jun 19 '21 at 17:22
Ok, so it is correct to say that for every two random variables, we find a representation $Y = f(X) + \epsilon$ with a suitable function $f$ (e.g. $f = E[Y \,|\, X = \,\cdot\,]$)? – rkvymvqt Jun 19 '21 at 17:22
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/126668/discussion-between-rkvymvqt-and-ariel). – rkvymvqt Jun 19 '21 at 17:25

Why isn't every nonparametric model with random model design an additive noise model?

1 Answers1