0

I am running regression model, and I found out that the one independent variable is dispersed in a very narrow range.

This is the summary statistic of variables, and variable age has some outliers, and not normality distributed.

  size----------------------        age ----------------             dc       
 Min.   : 69.30 -----------  Min.   : 2.20  -------- Min.   : 0.30  
 1st Qu.: 86.10 -------  1st Qu.: 4.30 -----  1st Qu.: 9.80  
 Median :101.60 ----- Median : 8.80 -----  Median :13.90  
 Mean   : 99.06 --------  Mean   :11.94 -------  Mean   :14.14  
 3rd Qu.:111.00 ------  3rd Qu.:16.85 -----  3rd Qu.:18.00  
 Max.   :134.80 --------  Max.   :49.70 -------  Max.   :29.00  
       dt  ---------------------          price       
 Min.   : 3.10 ---------  Min.   : 311.6  
 1st Qu.:18.85 ----  1st Qu.: 486.7  
 Median :31.20 ----  Median : 589.9  
 Mean   :34.02 ------  Mean   : 600.0  
 3rd Qu.:49.30 ----  3rd Qu.: 691.4  
 Max.   :74.20 -------  Max.   :1005.5 

This is the model I am running.

fit <- lm(price ~ size + age + dc + dt, data=property)

Call:
lm(formula = price ~ size + age + dc + dt, data = property)

Residuals:
    Min      1Q  Median      3Q     Max 
-200.82  -79.53    5.38   63.98  244.10 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 158.8981   111.9334   1.420   0.1597    
size          5.9175     0.8531   6.937 1.04e-09 ***
age          -2.3803     1.2460  -1.910   0.0598 .  
dc            0.6866     2.2066   0.311   0.7565    
dt           -3.7139     0.6255  -5.938 7.58e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.5 on 78 degrees of freedom
Multiple R-squared:  0.553, Adjusted R-squared:  0.5301 
F-statistic: 24.13 on 4 and 78 DF,  p-value: 5.182e-13

Here, I can see that the model is not fitting well since R-squared value is small. If I re-fit the model without variable age, the fit is even worse.

Call:
lm(formula = price ~ size + dc + dt, data = property)

Residuals:
     Min       1Q   Median       3Q      Max 
-190.807  -87.045    6.956   59.852  264.729 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 113.6972   111.2233   1.022    0.310    
size          5.9812     0.8666   6.902 1.15e-09 ***
dc            1.4765     2.2036   0.670    0.505    
dt           -3.7343     0.6358  -5.874 9.61e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 105.2 on 79 degrees of freedom
Multiple R-squared:  0.5321,    Adjusted R-squared:  0.5143 
F-statistic: 29.95 on 3 and 79 DF,  p-value: 4.933e-13

The R-squared value is smaller than the model with variable age. I took some outliers in age variable and run a regression again, but still there is no big difference. I tried transformation for all other variables and dependent variable also, but still it improves very small adjusted R-squared value.

The data is all numeric variables, so adding interaction terms do not help in this model I guess. Should I convert variable age to categorical variable? When I scatterplot the variable age, it seems like almost every observations are in a certain range. So, I am thinking if I can convert certain range of age variable into 1 and 0, and then make it categorical variable to see if I can detect some possible interaction with other variables. I do not know which step I should start with in order to fit the best model.

And also, I found out something interesting that the model without intercept increases adjusted R-squared value significantly!

Call:
lm(formula = price ~ size + age + dc + dt - 1, data = property)

Residuals:
    Min      1Q  Median      3Q     Max 
-201.57  -78.66   14.37   69.96  258.56 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
size   7.0281     0.3423  20.531  < 2e-16 ***
age   -2.0064     1.2257  -1.637   0.1056    
dc     3.0443     1.4622   2.082   0.0406 *  
dt    -3.4364     0.5979  -5.747 1.63e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 104.1 on 79 degrees of freedom
Multiple R-squared:  0.973, Adjusted R-squared:  0.9717 
F-statistic: 712.5 on 4 and 79 DF,  p-value: < 2.2e-16

R-squared value increased by 0.97! Should I remove the intercept? I am kind of confused if there is any other possible way to improve this model. Can someone give me some tips? How should I deal with skewed variable age?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Allie Kim
  • 1
  • 1
  • 1
  • 2
    To compare nested models use anova and not a descriptive comparison of $R^2$s. $R^2$ is not meaningful without the intercept. – utobi Nov 29 '16 at 14:06
  • 3
    As @utobi mentioned, removing the intercept is problematic. For reference, see [this answer](http://stats.stackexchange.com/questions/26176/removal-of-statistically-significant-intercept-term-increases-r2-in-linear-mo/26205#26205). – Jason Morgan Nov 29 '16 at 14:45
  • Many people would be very happy to see your summary table from the model you reject with size and dt as substantial predictors. – mdewey Nov 29 '16 at 18:26
  • I think the question lacks the information what you want to know from the model. Do you want to interpret the coefficients? Or do you want to get good predictions? Something else? And what about the narrow independent variable from the question. Somehow its never mentioned again after the first paragraph :) – ndevln Dec 04 '19 at 12:36

1 Answers1

2

Okay, so let's back up a bit from the concrete problem you are facing to a theoretical framework so that you can make good concrete choices.

First, your problem is called a model selection problem. You do not know which model to choose, but you have variables that you think are of interest. You have decided to use linear regression to model the problem. This implies that you believe there really is a straight line relationship between each factor and the dependent variable. You should plot these each independent variable against the dependent to verify this is true.

To begin with, ordinary least squares(OLS) does not care about the distribution of the independent variables, provided they come from a distribution with a defined variance. If an independent variable comes from a narrow range, this is no problem. OLS is not at all impacted by restrictions on the range of the independent variables.

On the other hand, if the dependent variable were restricted to a range, a special set of techniques usually called LimDep models would be used. LimDep just means limited dependent variable models.

Second, you should not do a transformation of an independent variable except to bring it into a linear relationship with the dependent variable. Even then, you have to be careful, because the regression may no longer mean what you intended it to mean.

Third, any variable that should be present due to a theoretical reason or concern MUST be in the equation. Removing a variable that should be present by theory is always wrong, even if it makes everything work better.

Fourth, you are not to be worried about the statistical distribution of the independent variables. Rather you are to be concerned with the distribution of the residuals. They should tend toward normality as the sample size becomes large. If they do not, then you have a concern as there may be some other thing going on in your MODEL that you do not realize is going on.

Fifth, there are two primary ways to do model selection, there are a few minor ones we will ignore. One of them is simple, but has problems. the other is very complicated and requires a lot of skill. The simple is called step-wise regression. The complicated one is called Bayesian model selection. I will explain the simple way.

Step-wise model selection either starts with no variables and adds one at a time, called forward selection or starts with all of your variables and removes one at a time. There is a also a method that goes in both directions.

Since you are using R, I will assume you are using R Commander. If you are not using R Commander, load it. It will make your life simpler. Pull all of your variables into a data frame or this will not work.

At the ribbon at the top of R Commander select "Models." When the menu pops up select "Stepwise Model Selection." This will give you several choices. In your case, I would choose "Backward," because you seem to believe that most of these variables are what matters. It will give you a choice of "AIC" or "BIC." The AIC is a bit older and is an approximation of what now is known as the "BIC." As the BIC, which stands for Bayesian Information Criterion, is theoretically sound, us the BIC.

You want to choose the model with the highest BIC, which should just be the last model listed in the output. The BIC is not without issues. It is an approximation of the Bayesian Model Selection, but has assumptions added in that make it not quite the same. Additionally, you could argue that the degrees of freedom are wrong in stepwise regression because you should be allowing a degree of freedom for each model that is considered as well.

Additionally, if you have variables that theoretically must be present, then you should choose the model with the highest BIC that also includes those variables and ignore any that do not.

Finally, you would want to use a stricter criterion for the F statistic than, say 5%. Each model that is run is really a test and although you could do manual adjustments by solving the Holm-Bonferroni method to assess family-wise error rates, most people do not realize to do that. I do not know if there is an R program to do this. I suspect there is not. I think there is no easy built in solution.

This makes it quite possible that your model, although the best, may not really be significant.

Sixth, $R^2$ is not a valid tool to select which model is best. It is very common for the best model to have a lower $R^2$ than other choices of models. If your model is the valid model then the $R^2$ will give you an interpretation of what percentage of variability is due to the variables that were chosen, but only if you pick the valid model. It will not help you find it.

Finally, seventh, setting the intercept to zero is a very strong theoretical statement. There are cases where it logically MUST be zero. Because this is such a strong statement it drove up your $R^2$. You are saying that a particular value MUST hold. You are defining a relationship and removing any uncertainty as to its value. That will have all kinds of impacts because you are saying you know this to be true with 100% certainty.

Whenever you tell a statistical model that something must be true with 100% certainty, then the rest of the model will adjust around that as if it were a fact. You want to be sure something is a fact before you say it is a fact. The very fact that you wanted to see what would happen if you removed it says to me that you did not believe this to be a fact. Therefore you have to keep it because you have no theoretical reason to believe it MUST be zero. Based on what you have shown, I have no reason to believe that either because I do not know what your other variables are.

Your intercept is the garbage pail of statistics. It contains all the stuff you forgot to put into your model. By removing it, you have moved the garbage into the intercepts. Who knows what kind of garbage you have moved into your intercepts?

Dave Harris
  • 6,957
  • 13
  • 21
  • Thank you for the comment! it is very helpful. But, I want to know what is the OLS equation exactly? it says OLS is weighted linear regression model, what is the weight and implication of it? How should I interpret the relationship between independent and dependent variable? I was thinking only use lm code in R, but found out that there is ols and glm also. I do not think that indepedent and dependent variables have linear relationship. But I do not know which code to use and how to interpret the model. And also, is it possible to create dummy variables with both numeric variables? – Allie Kim Nov 29 '16 at 22:01
  • That is a very different set of questions than the first. I only have a few hundred characters. OLS is weighted by the covariances. You do not interpret the weights. If you do not know what you are doing, then you do not use OLS or GLM, or any other regression. The ols equation is $\mathbf{Y}=\mathbf{X}'\boldsymbol{\beta}+\boldsymbol{\epsilon}$. You should interpret them as a slope equation with an intercept and nothing more. If a particular slope is not significant, then you should interpret it as 0. You should not create dummy variables without a sound reason as you lose information. – Dave Harris Nov 29 '16 at 22:48
  • Oops, that should have been $\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}$. – Dave Harris Nov 29 '16 at 22:49
  • "If a particular slope is not significant, then you should interpret it as 0." I beg to differ. That's not how significance tests work. – Roland Nov 30 '16 at 08:46
  • I am taking that from Pearson and Neyman's decision theory. If you are not using that, then I agree with you, should not treat it as zero. – Dave Harris Nov 30 '16 at 14:31