Regression specification consequences

Question

Suppose a true model is $Y_i = βX_i +u_i$ , where $β$ is parameter and $u$ is the random error, and $i$ denotes the number of observations.

But instead of fitting this regression through the origin, it is fitted by the usual intercept-present model: $Y_i = α_0 +α_1X_i +v_i$ where the $α$'s are parameters, $v$ is the random error, and $i$ denotes the number of observations.

What would be the consequences of this specification error on the model?

Would you expect a smaller sum of squared errors with the intercept-present model, because there are more parameters in the model? But the regression line won't pass through the origin even though it should? Because the true model does not include an intercept term, does this mean that the intercept term in the second model will be insignificant? Is this correct?

Relevant: https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model — kjetil b halvorsen, Nov 27 '18 at 09:32

score 1 · Answer 1 · answered May 13 '16 at 16:43

You are describing a situation in which you are fitting a model with an extra parameter (the intercept) relative to the generating model. This will produce an overfit model because you have too many parameters. You ask two questions: 1) what happens to the model fit when you add the extra parameter, and 2) does the overparameterized model correctly infer that the intercept is non-significant?

I will answer with simulations. First, let's simulate 500 cases of the scenario you describe (linear model with intercept=0), and for each case, we'll fit both the overparameterized regression model (slope and intercept) and a regression model with just a slope but no intercept. We'll extract the p-value for the Intercept in the full model, and the log likelihoods and R-squared values for both models.

# Simulate 500 datasets
set.seed(23)
p <- r1 <- r2 <- lh1 <- lh2 <- c()
for (i in 1:500) {
    x <- runif(20)  
    y <- x + rnorm(20, sd=0.1)

    # Fit model with intercept
     fit1 <- lm(y ~ x)

     # Fit model without intercept
     fit2 <- lm(y ~ -1 + x)

     # Store p values, R-squared, and log likelihoods
     p[i] <- summary(fit1)$coefficients[1,4]
     lh1[i] <- logLik(fit1)
     lh2[i] <- logLik(fit2)
     r1[i] <- summary(fit1)$r.squared
     r2[i] <- summary(fit2)$r.squared
}

Let's address your second question first, by looking at Type I error rates (i.e., false positives) for the intercept term in the overparameterized model.

# Type I error rate for intercept
sum(p < 0.05)/500
## 0.05

The Type I error rate is 0.05, which means that 5% of the time the overparameterized model incorrectly inferred that the intercept is significant, and 95% of the time it inferred that the intercept is not significant. This is generally considered an acceptable Type I error rate.

Now what about the model fits? Let's compare the log likelihoods for the two models.

# Difference between log likelihoods
lhdiff <- lh1 - lh2

# Mean difference
mean(lhdiff)
## 0.5877251

# Number of times simpler model has a higher log likelihood
sum(lhdiff < 0)
## 0

You can see that on average the log likelihoods are higher for the full model, and there are 0 cases in which the log likelihood is higher for the simpler model (even though it is correct). This is the overfitting problem- adding more parameters always improves the fit. Now, you will notice something odd when you compare the R-squared values.

# Mean R-squared 
mean(r1)
## 0.8941026
mean(r2)
## 0.9710714

The mean R-squared is bigger for the simpler model. What is going on? Shouldn't the R-squared be higher for the model with more parameters? A good explanation for why this occurs can be found here. Basically, the variation accounted for by the intercept is not factored into the computation of sums of squares, so when you include an intercept in the model, the intercept will account for some of the variation in the data, but this will not be reflected in the R-squared. Thus, the R-squared is actually lower when you include the intercept term. This is not usually what happens when you add parameters though- adding parameters other than the intercept (e.g., adding higher order terms or more predictors/slopes) will generally increase the R-squared.

Regression specification consequences

1 Answers1