2

Let me preface this by saying I'm new to statistics.

I'm working with regression models, attempting to understand transformations a bit more. I'm modeling (Y~X) and I get an $R^2$ of 0.4. I see that the residuals of this plot are left skewed so I take (Y^2~X) assuming that would correct the issue but now my $R^2$ is 0.3. Just out of curiosity, I did (Log(Y)~X and got an $R^2$ of 0.5.

I'm really not sure what is going on and not sure what transformation I should use going forward.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
madsthaks
  • 277
  • 3
  • 11
  • 2
    $R^2$ on models for $Y$ and $t(Y)$ for a nonlinear transformation $t$ are not comparable. See (for example) the discussion [here](http://stats.stackexchange.com/questions/90149/pitfalls-to-avoid-when-transforming-data). You also can't compare $s^2$, $AIC$, $BIC$, ... Also see [comments here](http://stats.stackexchange.com/questions/72288/is-adjusted-r-squared-appropriate-to-compare-models-with-different-response-vari) – Glen_b Aug 04 '16 at 01:36
  • Do plot the data again and again, for your own sake to see what is going on, and to allow us to give specific advice. In each case it is an easy scatter plot and a fitted line. – Nick Cox Aug 04 '16 at 08:09
  • 1
    It's most unlikely that $Y^2$ ~ $X$ and $\log Y$ ~ $X$ are both serious models for the data. – Nick Cox Aug 04 '16 at 12:04

2 Answers2

5

The total sum of squares $\text{SST}=\sum(y_i-\bar y)^2$ will be altered by transformation.

The total variation available to be explained in the three cases ($Y_0=\log Y, Y_1=Y, Y_2=Y^2$) will be different.

Specifically, if $Y$ tends to be substantially larger than $1$, you'll compress the variation by logging it and similarly expand the variation by squaring it (if $Y$ is positive but tends to be much smaller than $1$ then the log transform will stretch it and the square will compress it).

That stretching/compression may tend to explain the changes in your $R^2$.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
VCG
  • 683
  • 5
  • 9
  • This is a bit brief by our standards at the moment. Do you think you could expand on it a little? When you write "the 3 will change", this seems quite ambiguous to me - what is it that the "3" is referring to? – Silverfish Aug 04 '16 at 02:13
  • @Silverfish Sorry ya I just started using this site so I should have made this a comment. Can I change it to go there instead? – VCG Aug 04 '16 at 02:14
  • I've flagged it for you so a moderator can convert it. There is a "flag" button next to the "share" and "edit" buttons - at least for me. – Silverfish Aug 04 '16 at 02:19
  • @Silverfish Thanks and sorry. I'll be more considerate next time. – VCG Aug 04 '16 at 02:35
  • 3
    Actually, it's so close to a reasonable answer I'd rather edit it than move it.I hope that's okay with you VCG. Feel free to edit. If you really would prefer I roll back to your original brief one and make it a comment, that can still be done – Glen_b Aug 04 '16 at 03:07
  • @Glen_b Thanks for doing that! Now I know what a good answer looks like. – VCG Aug 19 '16 at 15:08
  • The main ideas were there already. It just needed a little more explanation. – Glen_b Aug 20 '16 at 01:25
-5

From the reference I give below: $R^2$ is explained as,

$R^2$ = $Explained \ Variation / Total \ Variation$

where,

1) $R^2$ is always between 0 and 100%:

2) 0% indicates that the model explains none of the variability of the response data around its mean.

3) 100% indicates that the model explains all the variability of the response data around its mean. the "variation divided by the total variation."

Also from another reference:"...The coefficient of determination, $R^2$, is useful because it gives the proportion of the variance (fluctuation) of one variable that is predictable from the other variable. It is a measure that allows us to determine how certain one can be in making predictions from a certain model/graph." In http://mathbits.com/MathBits/TISection/Statistics2/correlation.htm

Since expansion and contraction was covered by the other answer, I just want to make some comments on of aspects that effect the $R^2$. An important part of $R^2$, is the selection of the function used to fit data. You could have functions that would have have the same expansion or contraction and they can have different $R^2$. With the results you mention and the $R^2$ given I would be incline to try a polynomial (in you independent variable) to see how it fits. This function that contains multiple terms that are powers of x that would fit best, which would mean you can try a polynomial fit. In the most general case of the polynomial, you would use spline regression to find an fit. The $R^2$ review is the first step in analyzing data. Note: "Pearson Product-Moment Correlation" (which can be found on the Internet) discusses using $R^2$ to determine the "strength of the correlation."

The general reference on regression (and also "over fitting") I mention above a few times on how to interpret the correlation coefficient is "Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?" is http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit.

Lastly, just want to make a general comment on function selection vs model. A model is development from the principles and laws of the field of study you are working in. In many case a model is developed even before data is collected (e.g. Theoretical Physics -- I have done this many times). On the other hand just selecting functions to try to "fit" data from experiments is not classified as a model. You are just looking for the best fit to the data (again as a first step) -- then others could study your data and develop/derive a model.

jimmeh
  • 71
  • 5
  • 2
    It is never valid to use $R^2$ to compare models when you are transforming the *regressor* variable $x$. Although there are principled methods to search for and identify transformations of $y$, $x$, or both that improve a model, the recommendations in this answer are not among them. – whuber Aug 04 '16 at 18:21
  • The correlation coefficient can be used to compare models, because it corresponds to the "best fit" for a model. You just don't ever "over fit" when taking this approach. Here is an example where is was used to get the best model: http://math.usask.ca/~miket/S344D..pdf – jimmeh Aug 04 '16 at 23:14
  • 3
    That comment is simply wrong, as is made abundantly clear in hundreds--perhaps thousands--of posts here on model fitting, model comparison, and overfitting. Your reference is a generic textbook account of various methods of model selection with OLS (some of them now outmoded or deprecated), without any evaluation of their properties. There's nothing in there that supports your assertions. – whuber Aug 04 '16 at 23:16
  • I saw posts that use it to get the model. Where is your reference(s)? You are just making statements. Anyone would want to find the "best fit" for their data as long as you don't "over fit" as I mentioned above. My reference, specific has a page where they give correlation coefficients for didn't models then select the "best" based on the closest correlation coefficient to 1. Read the entire reference the title is "Multiple Regression - Selecting the Best Equation." – jimmeh Aug 04 '16 at 23:23
  • "...One of the simplest method for understanding a feature’s relation to the response variable is Pearson correlation coefficient, which measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables..." in http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/ – jimmeh Aug 04 '16 at 23:31
  • Also from the reference I gave above there is "...R-squared is a handy, seemingly intuitive measure of how well your linear model fits a set of observations..." Is it only way to determine the "best model?" No, but it is a start and just using a general comment like "it's wrong to use the correlation coefficient to select a model..." is just throwing out a valuable piece of information. I would suggest you read model entry above more carefully. – jimmeh Aug 04 '16 at 23:37
  • 4
    I have to agree with @whuber. The fact that correlation measures linearity in some space e.g. $(x, y)$ doesn't mean that it works well when used to select models by comparing results in some other space e.g. $\log x, \log y$. One of many examples is that a high correlation in one space could be an artefact of an outlier; taking logs could reduce the correlation but on any other grounds produce a configuration better suited to modelling. – Nick Cox Aug 05 '16 at 16:19
  • Like the point I made with him. It's a piece of information that should just be ignored. When used correctly it provides an initial identification of a function (not model that's a different thing as I mention in my answer) to use to represent the data. Then you move to the next step of looking more closely at the function and the data. That is where you "outline" would come in. Also if you read the references that I provide, they agree with me. – jimmeh Aug 05 '16 at 18:32
  • to continue from previous: Anyway, when you look closer, is where your "outline" could come in. Also if you read the references that I provide, they agree with me. I see know reference by Nick Cox or whuber on these topics. – jimmeh Aug 05 '16 at 20:17
  • try to remember that using $R^2$ is the FIRST STEP in determining if a functions is the best for data. So discussion of "outliners" first is not working logically. BTW https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php discuss using $R^2$ for the strength of relation between variables.Logical thought means that if it can be used to get the strength of relationship between variables, then it can be used as the INITIAL METHOD to compare fitting functions. – jimmeh Aug 05 '16 at 23:18
  • 1
    Discussion should be reserved for *statistical* argument, not personal comment. You can expect any comments making personal references to be removed, even ones that might also contain useful argument or references. – Glen_b Aug 05 '16 at 23:56
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/43549/discussion-on-answer-by-jimmeh-understanding-the-transformation-on-response-vari). – Glen_b Aug 06 '16 at 02:45
  • In this own blog there is an answer with "check mark" where it states: "If, for some reason, you are going to include only one variable in your model, then selecting the predictor which has the highest correlation with yy has several advantages..." http://stats.stackexchange.com/questions/138860/is-using-correlation-matrix-to-select-predictors-for-regression-correct – jimmeh Aug 06 '16 at 16:36
  • Continuing on the last comment: http://blog.uwgb.edu/bansalg/statistics-data-analytics/linear-regression/what-is-the-difference-between-coefficient-of-determination-and-coefficient-of-correlation/ "...R squared or coeff. of determination shows percentage variation in y which is explained by all the x variables together. Higher the better. It is always between 0 and 1.." & http://mathbits.com/MathBits/TISection/Statistics2/correlation.htm with linear correlation coefficient, "...The value of r is such that -1 < r < +1. linear correlations and negative linear correlations, respectively..." – jimmeh Aug 07 '16 at 01:31
  • Any added comments relevant to this answer should go in the linked chat room(s). If you have trouble accessing the chat room for any reason, please flag. Comments on other answers do not belong under this answer. Please acquaint yourself with the site before trying to use it further. – Glen_b Aug 07 '16 at 01:31
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/43579/discussion-between-glen-b-and-jimmeh). – Glen_b Aug 07 '16 at 01:32