Similarities and differences between correlation and regression

Question

If I want to investigate how two continuous variables are linked, what is the difference between calculating the correlation coefficient (Pearson's $r$) versus calculating the (simple linear) regression coefficient?

I see people who, if the regression coefficient is significantly different from zero, talk about the two variables as if they are correlated, which is confusing as it suggests that the two coefficients (correlation, regression) are the same thing.

Having said that, isn't $r$ a measure of the (regression line) slope anyway? I'm confused!!

See [What's the difference between correlation and simple linear regression?](http://stats.stackexchange.com/q/2125/32036) — Nick Stauner, Jul 20 '14 at 22:42
@Alexis Hahah, yup! Seemed like it ought to have been flagged, though. — mkt, May 05 '18 at 16:30
@mkt The accepted answers here and there do seem to differ, though. — Alexis, May 05 '18 at 17:24
@Alexis Agreed - but it does seem like the questions ought to be linked, at least. Happy to be overruled on this if I'm wrong, though - I'm not always clear on the norms when it comes to cases such as these. — mkt, May 05 '18 at 18:10

Alexis · Accepted Answer · 2015-05-07T02:53:55.713

10

OLS regression tells you more than the (linear) correlation coefficient. Also, the latter is one of the things you get from the former. Here's what you get with OLS:

A characterization of a linear trend describing how Y relates to X. This trend includes:

1a. The slope (aka beta, effect, coefficient, etc. depending of discipline) of that line, which tells you how much you estimate Y will change given a 1-unit increase in X.

1b. The Y-intercept, which may or may not be of interest, depending on the substantive nature of one's research questions.
A characterization of the strength of association... that is, does the line $Y = \beta_{0} + \beta_{X}X$ describe the data really well, or does it only kinda describe the data. In the former case, most of the observed data points lie on or close to the regression line; in the latter case the data points may lie quite a ways off the line. Usually, this is reported as $R^{2}$, which is the same thing as Pearson's $r^{2}$
One gets predictions of the value of Y given a value of X complete with an estimate of the uncertainty of that prediction.

Pearson's correlation coefficient gives one (2), but gives only the sign of the slope in (1a), and does not give intercepts (1b), or predictions (3).

edited May 07 '15 at 02:53

answered Jul 20 '14 at 22:20

Alexis

26,219
5
78
131

Would it be correct to say, then, that R (or R^2) is a measure of goodness of fit, while the slope (β) is a measure of effect size? What puzzles me is that goodness of fit and effect size should (I think) be independent, whereas R and β are linearly dependent. – z8080 Jul 21 '14 at 12:53
Also, a different question: if correlations tell us nothing about the predictive power of variables, then why is it that R^2 is interpreted as the percentage of variance in Y explained by X? To me this definition sounds like it implies that, given a new instantiation of X and of Y, that percentage would still be given by R^2, which surely is a prediction?.. – z8080 Jul 21 '14 at 13:07
2

I did not write that correlation tells us nothing about predictive *power*, I said it tells us nothing about *predictions*, as in you cannot make predictions (other than sign) using a correlation. – Alexis Jul 21 '14 at 15:04
1

$R^{2}$ and $\beta$ are dependent in that if $\beta$ is zero, then $R$ is zero (leaving out nuances of $\hat{\beta}$ and "near zero" due to space), and if $\beta$ is not zero, then $R^{2}$ is likewise not zero. If there *is not* a linear association, then a linear function of $X$ cannot "explain" *any* of the variation in *Y*; if there *is* a linear association then $X$ m must "explain" *some* variation in *Y*. (Also leaving out nuances about "approximately linear"; also leaving out the unidirectionality of OLS versus the bidirectionality of $r$). – Alexis Jul 21 '14 at 15:12
2

Forgot: slope *is not* generally a measure of effect size, because the *range* of your explanatory variable may be small, or may be large. Some folks only use the symbol $\beta$ to indicate slope for a standardized explanatory variable, and in such a case (i.e. standardized explanatory variables), the slope measures effect size. – Alexis Jul 21 '14 at 15:58

score 10 · Answer 2 · answered Jul 20 '14 at 22:31

10

To focus one just one aspect of the question (@Alexis answer analyzes well the greater picture), the sample correlation coefficient between $Y$ and $X$ is

$$r = \frac { \operatorname{\hat Cov}(Y,X)}{\hat \sigma_y\hat \sigma_x}$$

while in a simple regression $Y = \beta_0 + \beta_1X+ u$, the OLS estimator for the slope coefficient is

$$\hat \beta_1 = \frac { \operatorname{\hat Cov}(Y,X)}{\hat \sigma_x^2}$$

Combining, we have the relation

$$\hat \beta_1 = \frac {\hat \sigma_y}{\hat \sigma_x}r$$

Pondering this last one, I believe it will provide useful intuition.

answered Jul 20 '14 at 22:31

Alecos Papadopoulos

52,923
5
131
241

Note from this answer that the correlation coefficient is a dimensionless number, while the slope estimate $\hat\beta$ has the dimensions $\text{units of $y$}/\text{units of $x$}$. So it is obvious from dimensional analysis that they are different, see https://stats.stackexchange.com/questions/89355/lognormal-distribution-standard-deviation-and-physical-units/89391#89391 and https://stats.stackexchange.com/questions/184848/converting-a-model-from-square-feet-to-square-meter/184901#184901 – kjetil b halvorsen Sep 26 '17 at 20:33

score 5 · Answer 3 · answered Jul 20 '14 at 23:56

If I want to investigate how two continuous variables are linked, what is the difference between calculating the correlation coefficient (Pearson's r) versus calculating the (simple linear) regression coefficient?

The regression line is $E(Y|X=x)$. Correlation is a quite different object.

A regression slope is in units of Y/units of X, while a correlation is unitless.

I see people who, if the regression coefficient is significantly different from zero, talk about the two variables as if they are correlated, which is confusing as it suggests that the two coefficients (correlation, regression) are the same thing.

No, only that they are related, which they are. (Their p-values are effectively the same)

Having said that, isn't r a measure of the (regression line) slope anyway?

Not of slope, no, as mentioned above. If I change from measuring in meters to measuring in mm, my slope changes by a factor of a million, but my correlation doesn't change at all. But they're related.

Thanks Glen, this is helpful. But if you change your units from meters to mm on both axes, as you say, wouldn't that in fact leave the slope (regression coeff) also unchanged, as the 10^-3 (rather than 10^-6, no?) factor is reduced in the fraction? — z8080, Jul 21 '14 at 09:31
While that's true, changing both units in concert doesn't serve to reveal the distinction between slope and correlation (but instead disguises it), so it's unhelpful as a response to the question at hand. (In general $y$ and $x$ are not even in the same *kind* of unit, so there's no particular reason to expect them to change together.) — Glen_b, Jul 21 '14 at 09:38

score 2 · Answer 4 · answered Feb 26 '16 at 14:33

On the intuitive side, I have been thinking about the following.

The Pearson correlation is a 2-dimensional linear approximation, while the linear regression is n-dimensional linear approximation. Therefore, the latter offers an estimate of the correlation that accounts for a lot of other features that might in/deflate the estimate obtained with the Pearson correlation.

See this example1, for the Pearson correlation. Consider a map without info on altitude on it and suppose you can move on it linearly (presence of rivers or cliffs do not matter). You know the time you left point A and reached B, then you compute the speed.

See this example2, for the linear regression. If instead you move on a map with info on altitude and you have to accounts for all a lot of other info on the ground you are facing (i.e., rivers or cliffs), but still the time you left point A and reached B is as in example 1, the value of the speed you will get will be different (very likely it will be higher).

Although the linear regression offers only an approximation of the average speed, it is still better than the initial approximation you got with the Pearson correlation.

Do some of you find something wrong in this example? (your answers will be very useful as I normally use this example in class)

In any case, I hope this example helped to understand the difference between the two techniques.

I don't follow what you are trying to get at here: the analogy with correlation and regression seems highly indirect at best. While it's certainly true that regression can be applied with several variables, what's key is that one of those is privileged as being the response or outcome. (I know about multivariate regression, which I don't think you're trying to cover.) — Nick Cox, Feb 26 '16 at 14:44
yes, true, as easy as it sounds, I have not thought of that at lesson. Thank you! — Fuca26, Mar 02 '16 at 11:14

Similarities and differences between correlation and regression

4 Answers4

Linked

Related