0

I have 2702 records with one target variable (Y) and 11 independent/predictor variables (X1-X11). enter image description here I am doing multivariable regression to understand if I can predict Y using X or if there is any correlation between Y and X's.

Here is the ANOVA result. enter image description here Conclusion 1: Since my R square is low,I do not have a good model. The independent variables I have are not doing a good job/ or are not enough to explain the variation in Y.

Confusion 1: Since the p value for individual variables are significant, can I conclude that there is a strong correlation between Y and X1,X2,X3,X7,X8,X10,X11??

Confusion 2: Even though the p value is significant, there is no difference in correlation between significant variables and non-significant variables. enter image description here

So I am really confused how to interpret my result. I feel like saying " there is no relation between Y and X based on Conclusion 1 and Confusion 2 but again how do I interpret confusion 1.

Any help would be appreciated, and I also want to know what kind of material can I study to get to solve my lack of understanding in these kind of scenarios. I feel like i am missing a piece in my understanding of data science.

  • I believe you would profit from reading an introductory level textbook. [We have a helpful list of free statistical textbooks.](https://stats.stackexchange.com/q/170/1352) – Stephan Kolassa Jul 25 '18 at 16:34

3 Answers3

2

Conclusion 1: Since my R square is low,I do not have a good model. The independent variables I have are not doing a good job/ or are not enough to explain the variation in Y.

Maybe. Some problems I encountered consider >0.95 R2 "good", other problems I had was fine with just >0.30 R2. 0.44 R2 means your model can account for abour 44% of the variability. Wherther it's good or bad depend on the problem domain. Please seek a consultant.

Confusion 1: Since the p value for individual variables are significant, can I conclude that there is a strong correlation between Y and X1,X2,X3,X7,X8,X10,X11??

Definitely not. Significant predictors are not indication of correlation or/and importance of the predictor. http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-identify-the-most-important-predictor-variables-in-regression-models has more information. Add to your confusion, insignificant predictors could also be important. You can't just look at the table like that.

Some of your variables might be highly correlated. Look at X4 and X5. They have the identical values in your screenshot. They are both in-significant, but that can be caused by multicollinearity. That doesn't mean X4 or X5 or a linear combination of them should be omitted in the model.

What you can say:

There is strong linear relationship in the model as the F statistics is highly significant. This is a good start. However, the model is suspicious victim of multicollinearity and overfitting. Further investigation such as removing highly correlated variables is necessary.

SmallChess
  • 6,764
  • 4
  • 27
  • 48
1

You are mis-interpreting the p-value and the correlation, and also how they relate to "importance".

The p-value is a measure of how consistent the data is with the combination of no relationship and random chance (high p-value meaning data could easily have come from that combination, low p-value meaning the data is unlikely due to chance from a no-relationship case). Sample size and other things play into the p-value, beyond just the correlation. With a large enough sample size you can have a correlation of 0.1 (not much relationship) with a p-value < 0.0001, with a small enough sample size you can have a correlation near 1 (or even 1) with a large (non-significant) p-value. P-value does not tell the strength of the relationship, just whether the observed relationship can be attributed to chance.

To further complicate things, in the multiple regression the p-value is conditional on the other terms being in the model.

Also, a low R-squared does not mean a bad model, just that there is variance unaccounted for, similarly some bad models can have large R-squared values. You need to just these in context of the science behind the data.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
-3

try the linear regression on X 1,2,3,7,8, 10 and 11 only, to see if it will get better or stay there. You cannot fit the model with that high number of variables. Less parameters could do the same job, Nevertheless, the model equation is not adequate. An indicator of the insignificant parameters can be found in the confidence intervall which crosses zero. So, these parameters can likely be zero.

Rockbar
  • 95
  • 3