Nonlinear regression SSE Loss

Question

Notation

$y_i$ is observation $i$ of some response variable $Y$.

$\hat{y}_i$ is the value of $y_i$ predicted by the regression.

$\bar{y}$ is the average of all observations of the response variable.

$$ y_i-\bar{y} = (y_i - \hat{y_i} + \hat{y_i} - \bar{y}) = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) $$

$$( y_i-\bar{y})^2 = \Big[ (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) \Big]^2 = (y_i - \hat{y_i})^2 + (\hat{y_i} - \bar{y})^2 + 2(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) $$

$$ \sum_i ( y_i-\bar{y})^2 = \sum_i(y_i - \hat{y_i})^2 + \sum_i(\hat{y_i} - \bar{y})^2 + 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$$

$$ = SSRes + SSReg + Other $$

When $Other = 0$, as we have in linear regression, then $SSRes$ is a perfectly reasonable measure of what strikes me as the real value of interest: $SSReg$. As one decreases, the other increases, so we can get a strong model fit (high $SSReg$) by minimizing $SSRes$.

However, $Other \ne 0$ in general, such as in nonlinear regressions. A popular nonlinear regression these days is a neural network. While neural nets may be most used for classification problems, they are perfectly reasonable to use in regression problems. In neural network regressions, I have seen $MSE$ as the loss function. For instance, sklearn's MLPRegressor uses SSRes as the loss function (same $argmin$ as $MSE$).

Minimizing $SSRes$ misses the $Other$ term! The $SSRes$ could be very small, yet there could be a major contribution from the $Other$ term that shows the regression model not to be good.

I've tried it out in Python, using some code I found on Stack Overflow for MLPRegressor. That $Other$ term definitely doesn't drop to zero.

from sklearn.neural_network import MLPRegressor
import numpy as np
import random
random.seed(2019)

x = np.arange(0.0, 1, 0.001).reshape(-1, 1)
y = np.sin(2 * np.pi * x).ravel()
nn = MLPRegressor(hidden_layer_sizes=(100,), activation='relu')
n = nn.fit(x, y)
train_y_pred = n.predict(x) 
Other = (train_y_pred - np.mean(y) ) * (y - train_y_pred)
sum(Other)

Questions

What is the reason for using $SSRes$ or $MSE$ loss when there is that $Other$ term?
This might be more philosophy or perhaps not so different from the first question, but am I off-base to claim that $SSReg$ is the real value of interest and that we use $SSRes$ as a proxy because we're used to minimizing loss rather than maximizing gain?

Code for linear regression:

import numpy as np
from sklearn.linear_model import LinearRegression
import random
random.seed(2019)
X = np.random.normal(10,1,100).reshape(-1, 1)
X = np.sin(X).reshape(-1, 1)
e = np.random.normal(0,0.25,100).reshape(-1, 1)
y = X + e
y.reshape(-1, 1)
reg = LinearRegression().fit(X, y)
y_pred = reg.predict(X)
resid = y - y_pred
Other = (y_pred - np.mean(y) )* (y - y_pred)
sum(Other)

Hummm. Maybe look at this for further thoughts https://stats.stackexchange.com/a/508394/99274. — Carl, Mar 06 '21 at 19:25
@Carl I liked the question title when you first posted, but I do not see how my question is particularly related to yours. Could it have gotten lost in a pretty long post you wrote? Could you please point me to which part of your post pertains to my question? — Dave, Mar 06 '21 at 23:10
R-squared ANOVA is $R^2 = 1 - \text{SSE} / \text{SST}$. Something is missing in this, and I think you put your finger on it, the "other". Search for the title of your question above. — Carl, Mar 07 '21 at 03:54
I do not understand how one can have a correlation without an interaction term. Can you explain, please? — Carl, Mar 07 '21 at 04:50
@Carl Correlation between what and what? Interaction between what and what? I give a code example at the end showing the “other” is zero (or super close to zero because of doing math on a computer) for a linear regression. — Dave, Mar 07 '21 at 05:00
The errors in measuring x and y are often correlated. Suppose I throw darts at a board, won't my errors in x and y away from the bull-eye be correlated as a result of my aim being more in an elliptical pattern than a rhomboid one? If the errors are uncorrelated the error "cloud" will be a rhomboid, not an ellipsoid. — Carl, Mar 07 '21 at 05:15
In ANOVA, how do you make an error measuring the categorical $X$? I just don’t follow what you’re thinking. — Dave, Mar 07 '21 at 05:24
@Carl I don’t even mention an $x$ in my equations. Where is $x$ coming into play? — Dave, Mar 07 '21 at 05:47
In code, you specify X = np.random.normal... and e = np.random.normal... Two random variates. Specified as independent without correlation. In that case, there is no interaction term to quantify, but is that case a good simulation model? Ah, that is the question, does it happen like that? — Carl, Mar 07 '21 at 07:33
@Carl Correlation of the predictor variables has no influence on the “other” term being zero in OLS regression. Try out a simulation with np.random.multivariate_normal. — Dave, Mar 07 '21 at 07:50
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/120524/discussion-between-dave-and-carl). — Dave, Mar 07 '21 at 07:51
@Carl You took an interest in this. Perhaps you would be interested in reading an answer (which I believe is reasonable) a few months later. — Dave, Nov 11 '21 at 23:58

Dave · Answer 1 · 2021-11-11T23:59:05.093

Two years later, I can answer #1.

If we assume a Gaussian response variable (common), minimizing $SSRes$ is equivalent to maximum likelihood estimation of the regression parameters. Consequently, we get maximum likelihood estimates, which feel like a natural way of estimating the parameters.

I think that this leads to answering #2 in the negative. We minimize $SSRes$ instead of maximizing $SSReg$ because we aim to find the “most likely” regression parameters via maximum likelihood estimation, not maximize the variance we can explain. (That these two notions do not coincide, however, blows my mind.)

score 0 · Answer 2 · answered Nov 12 '21 at 02:40

Humm, well being as you asked. The connection between MLE and ordinary least squares (OLS) has established for linear regression for normally distributed residuals. These are not equivalent in most (if not all) other circumstances. In general, OLS has omitted variable bias in the linear case, which is also problematic for nonlinear fitting. The magnitude of omitted variable bias is zero in special cases, e.g., when the $X_i$ are sequentially equidistant. This can be understood by noting that OLS minimizes error in the $y$ direction, and does not minimize error in both $x$ and $y$. To do the latter, one needs to perform some other type of regression, e.g., in the linear case Deming regression. This requires knowledge of both variance in $y$ and in $x$. In the linear fit equation case Deming regression has no omitted variable bias. For OLS, the minimum error is for prediction of least error in $y$, with omitted variable error, this makes the slope more shallow but optimizes the $r$-value. Deming regression will have a steeper slope, as it no longer "splits" the $y$-value error to be too much on one side and too little on the other just to minimize that error globally. In Deming regression, the correlation will be of lesser magnitude, but the regression line follows the "cloud" of data points more accurately and extrapolation beyond the range of the data is more accurate. There are nonparametric procedures that reduce omitted variable bias in the linear case, e.g., Passing-Bablok, and Theil-Sen, which unlike Deming regression do not require explicit knowledge of relative variances of $x$ and $y$.

The formula you wrote, after changing the order of subtraction of $ \hat{y_i}-\bar{y}$, is written as $$ \sum_i ( y_i-\bar{y})^2 = \sum_i(y_i - \hat{y_i})^2 + \sum_i(\hat{y_i} - \bar{y})^2 - 2\sum_i\Big[ (y_i - \hat{y_i})(\bar{y} - \hat{y_i}) \Big]\;\;,$$

and has the same structure as the law of cosines $$c^2=a^2+b^2-2\, a\, b\cos(\theta)\;\;,$$ for the exact same reasons. Moreover, this is (changing the sign back again) the variance of the sum of two correlated variables. In the answers to that question, note the equivalence of the law of cosines, i.e., the vector form, to the variance treatment, where $\cos(\theta)$ and the correlation coefficient $r$ are equivalent. Finally, error propagation, or at least the first approximation from Taylor series to it, takes the form $$\left(\frac{\sigma_f}{f}\right)^2 \approx \left(\frac{\sigma_a}{a} \right)^2 + \left(\frac{\sigma_b}{b}\right)^2 + 2\left(\frac{\sigma_a}{a}\right)\left(\frac{\sigma_b}{b}\right)\rho_{ab}\;,$$ where $\rho_{ab}$ is the correlation between $a$ and $b$. To make a long story short, it would seem that for OLS in $y$, we are assuming that the $x$ and $y$ errors are uncorrelated, "Homoscedasticity and independence of the error terms are key hypotheses in linear regression where it is assumed that the variances of the error terms are independent and identically distributed and normally distributed. When these assumptions are not possible to keep, ... the variance of the parameters corresponding to the beta coefficients of the linear model can be wrong and their confidence intervals as well."

There's a lot in here that's interesting, but I don't see how it answers the question I asked two years ago. — Dave, Nov 12 '21 at 02:44
Dave, I just explained what your "other" is, and filled in the blanks in your questions. — Carl, Nov 12 '21 at 02:47

Nonlinear regression SSE Loss

2 Answers2

Linked