Should I evaluate my regression algorithm using MSE or correlation?

Question

I read papers that when the authors implement their regression algorithm (mainly using Conv Net) they sometimes use the correlation to evaluate their regression algorithm even though they used the Mean Square Error (MSE) as the loss function for their regression algorithm. My question is what does the correlation give that the MSE cannot? In other words, does the correlation give us a better understanding for our regression algorithm performance over MSE? If yes, how is that?

To illustrate what I usually see in the literature:

X -->|____MODEL____|--> Y'

Where X is the input and Y' is the approximated output. I see that some papers judge on the performance of their model by calculating the correlation between Y (desired output) and Y' (actual output) instead of the MSE between Y and Y'.

Correlation doesn't make much sense for measuring goodness of fit of a regression model, because it's invariant to scaling and translation. That is, you could take the predicted outputs, then add or multiply by some scalar such they become arbitrarily far from the true outputs. But, correlation would remain unchanged. Are you sure they used correlation (written as $R$) and not fraction of variance accounted for (written as $R^2$)? $R^2$ is a common goodness of fit measure. — user20160, Nov 18 '17 at 05:54
I think this question is about the difference between R square (the square of the correlation coefficient) and rmse in linear regression, which has been answered here. — someguyinafloppyhat, Nov 17 '17 at 21:08

score 2 · Answer 1 · answered May 23 '21 at 03:30

There is a particular way to address your question, it depends on how you measure correlation actually.

My question is what does the correlation give that the MSE cannot?

Let's see, we define the $\text{MSE}$:

\begin{equation} MSE = \frac{1}{n}\sum_{i=1}^{n}{\Big( y_i - \hat{y}_i \Big)^2} \end{equation}

It is a cost function that is well known to be convex, therefore is widely used by gradient-based optimization methods. Also, it is affected by large values which could be outliers that the model couldn't learn, in addition, every value is equally weighted in the overall cost calculation of the model.

On the other hand, a widely used metric to measure the relationship between two variables is $\textit{pearson}$ correlation:

\begin{equation} \rho(X, Y) = \frac{E \left[(X - \mu_{x}) (Y - \mu_{y}) \right]} {\sqrt{E \left[ (X - \mu_{x})^2 \right] E \left[ (Y - \mu_{y})^2 \right] }} = \frac{cov(X, Y)}{\sigma_{X}\sigma_{Y}} \end{equation}

Since such formulation is based on the linearity property of expectation, it will address a linear relationship between the $x$ and $y$ variables. It is well known $pearson$ is not robust to linear relationships with a significant amount of outliers, also to non-linear relationships even without outliers.

So, from the $\text{MSE}$ you have equal weight on all the samples, whether the model learns them or not, and from $\text{pearson}$ you have only the capability to measure the linear relationship. So what about outliers and non-linearities?

There is another option to use in order to address both of the previous situations, that is the $spearman$ correlation:

\begin{equation} S(X, Y) = \rho_{rg_{X}, rg_{Y}} = \frac{cov(rg_{X}, rg_{Y})}{\sigma_{rg_{X}}\sigma_{rg_{Y}}} \end{equation}

This is in fact a particular case of $pearson$ but applied to ranked versions of the variables. In a Spearman correlation, when two variables are non-linearly related, a monotonic relationship could be detected which will result in a coefficient of 1 (or -1), meaning that all data points with greater $X$ values than that of a given data point will have greater $Y$ values as well.

I will suggest you can consider the following as options for you to choose as a multiple choice answer:

It does depend on how you calculate the correlation. $\text{MSE}$ can give you a convex function well suited for gradient-based methods, it does not consider variations among errors just overall, equally weighted, aggregated individual errors.
This is only when using correlation: You can use $\text{pearson}$ if you are working on a type of outcome that you need to have a linearly proportional amount of error between model and ground truth. It does provide hints about how in some regions or subsamples of your data could be non-linearities addressed.
This is only when using correlation: You can use $\text{spearman}$ if you cannot afford to have non-linear differences in the errors among subsamples or regions of your data. Basically, use this one to test if your predicted values are increasing or decreasing according to the ground truth, for example.
Compounding the three to form a hybrid cost function. In that way, you can take the convexity of $\text{MSE}$, the local existing linear relationship with $\text{pearson}$ and the global non-linear monotonic relationship with $\text{spearman}$.

user2522806 · Answer 2 · 2017-11-19T01:54:16.260

0

MSE is the variance of the error in the model. Correlation between Y and Y' is a function of $R^2$ of the model. $R^2$ is the percentage of variance of Y explained by the model prediction Y'.

According to a regression model, $Y=Y'+\epsilon$ where $Cov(Y',\epsilon)=0\\$.
$Corr(Y,Y')=\dfrac{Cov(Y,Y')}{\sqrt{Var(Y)Var(Y')}}\\$

$Corr(Y,Y')= \dfrac{\sigma^2}{\sqrt{\sigma^2(\sigma^2+\sigma^2_{\epsilon})}}\\$ where $Var(Y')=\sigma^2$ and $Var(\epsilon)=\sigma^2_{\epsilon}$.

$Corr(Y,Y')=\sqrt{\dfrac{\sigma^2}{(\sigma^2+\sigma^2_{\epsilon})}}=\sqrt{\dfrac{Var(Y')}{Var(Y)}}=R$

$R^2$ is used as a measure of model quality as it indicates how much uncertainty in Y is being resolved by knowing Y'.

edited Nov 19 '17 at 01:54

answered Nov 17 '17 at 21:21

user2522806

716
4
10

$R^2$ loses the “percentage of variance explained” interpretation in a nonlinear regression, of which a neural network is one example (assuming a nonlinear activation function is in there somewhere, which is a pretty safe assumption). This has to do with the decomposition of square loss. Please see my derivation here: https://stats.stackexchange.com/q/427390/247274. – Dave Aug 15 '20 at 16:01
I said decomposing square loss; I meant decomposing the Rita sum of squares (which actually is square loss for a model that always guesses the average of all response variable observations pooled together). – Dave Aug 15 '20 at 16:20

Should I evaluate my regression algorithm using MSE or correlation?

2 Answers2