5

I have a dataset with around 15 independent variables. I am using a multi-regression model to fit the dataset. For model selection, I am using a backward elimination procedure based on the p-values. The adjusted R^2 for the model with all predictors is exactly 1. At this point, I concluded that maybe the model is also picking up noise. But, based on the model selection I removed 5 predictor variables and still the adjusted R^2 is 1. I am not sure if this correct or I am just modeling noise. Can someone comment on this?

Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106
Sahil Gupta
  • 61
  • 1
  • 5

2 Answers2

6

Dan and Michael point out the relevant issues. Just for completeness, the relationship between adjusted $R^2$ and $R^2$ is given by (see, e.g., here)

$$ R^2_{adjusted}=1-(1-R^2)\frac{n-1}{n-K}, $$ (with $K$ the number of regressors, including the constant). This shows that $R^2_{adjusted}=1$ if $R^2=1$, unless (see below) $K=n$.

$R^2=1$ occurs when all residuals $\hat u_i=y_i-\hat y_i$ are zero, as $$ R^2=1-\frac{\hat{u}'\hat{u}/n}{\tilde{y}'\tilde{y}/n}. $$ Here, $\hat u$ denotes the vector of residuals and $\tilde y$ the vector of demeaned observations on the dependent variable.

Dan discusses one reason to get an $R^2$ of 1. Another is to have as many regressors as observations, i.e., $K=n$.

Technically, this is because the $n\times K$ regressor matrix $X$ then is square. The OLS estimator $\hat\beta=(X'X)^{-1}X'y$ can then be written as (assuming no exact multicollinearity) $$ \hat\beta=(X'X)^{-1}X'y=X^{-1}{X'}^{-1}X'y=X^{-1}y $$ so that the fitted values $\hat y=X\hat\beta$ are just $\hat y=XX^{-1}y=y$, so that all residuals are zero.

Here is an illustration using artificial data (code below), in which regressors are generated totally independently of $y$, and yet we achieve an $R^2$ of 1 once we have as many of them as we have observations.

Code:

n <- 15
regressors <- n-1 # enough, as we'll also fit a constant
y <- rnorm(n)
X <- matrix(rnorm(regressors*n),ncol=regressors)

collectionR2s <- rep(NA,regressors)
for (i in 1:regressors){
  collectionR2s[i] <- summary(lm(y~X[,1:i]))$r.squared
}
plot(1:regressors,collectionR2s,col="purple",pch=19,type="b",lwd=2)
abline(h=1, lty=2)

When $K=n$, R however, correctly, does not report an adjusted $R^2$:

> summary(lm(y~X))

Call:
lm(formula = y ~ X)

Residuals:
ALL 15 residuals are 0: no residual degrees of freedom!

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.36296         NA      NA       NA
X1          -1.09003         NA      NA       NA
X2           0.39177         NA      NA       NA
X3           0.19273         NA      NA       NA
X4           0.51528         NA      NA       NA
X5          -0.04530         NA      NA       NA
X6          -1.28539         NA      NA       NA
X7          -0.72770         NA      NA       NA
X8          -0.14604         NA      NA       NA
X9           0.34385         NA      NA       NA
X10         -0.93811         NA      NA       NA
X11          2.23064         NA      NA       NA
X12          0.06744         NA      NA       NA
X13          0.21220         NA      NA       NA
X14         -2.29134         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 14 and 0 DF,  p-value: NA
Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106
  • Do you want to expand on what u hat and y tilde are? I wonder how many cases the OP has as too few cases can easily lead to perfect prediction. – mdewey Jan 31 '19 at 15:37
  • Right, I should have spelled out all my notation in the first place. I agree it would be helpful information to know the sample size OP has access to. – Christoph Hanck Jan 31 '19 at 15:42
  • By quoting an unnecessarily limited formula for the adjusted $R^2$ (which applies only to the ordinary regression situation but not to multiple regression) you arrive at an incorrect conclusion: the adjusted $R^2$ isn't even defined when there are as many regressors as observations. It's certainly not equal to unity in that case! – whuber Feb 01 '19 at 17:54
  • 1
    Ouch, I should have seen that this was not the right expression to connect the adjusted and standard $R^2$. I hope that my edit fixes this. – Christoph Hanck Feb 01 '19 at 19:39
4

An adjusted R squared equal to one implies perfect prediction and is an indication of a problem in your model. Adjusted R squared is a penalised version of R square, which is a way of describing the ratio of the residual sum of squares to the total sum of squares - as you approach 1 the implication is that there is no variation/deviation away from your model.

I would suggest you begin by looking at a correlation matrix, or put each predictor into your model individually to see which predictor is causing the issue.

In R, you will get a warning from lm: "essentially perfect fit..."

In the (single predictor) example below you will see that adjusted R square is less than 1 even when the correlation between y and x is greater than 0.99.

# create a data frame with some strongly correlated variables
myData<- data.frame(y = rnorm(n = 1000, mean = 0, sd = 1))
myData$x1<- myData$y
myData$x2<- jitter(myData$x1, factor = 10)
myData$x3<- jitter(myData$x1, factor = 1000)

# fit models
myModel1<- lm(y ~ x1, data = myData)
myModel2<- lm(y ~ x2, data = myData)
myModel3<- lm(y ~ x3, data = myData)

# output
summary(myModel1)
#> Warning in summary.lm(myModel1): essentially perfect fit: summary may be
#> unreliable
#> 
#> Call:
#> lm(formula = y ~ x1, data = myData)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -4.551e-15 -1.200e-17  6.000e-18  2.090e-17  3.455e-16 
#> 
#> Coefficients:
#>              Estimate Std. Error   t value Pr(>|t|)    
#> (Intercept) 1.404e-17  4.924e-18 2.852e+00  0.00444 ** 
#> x1          1.000e+00  5.085e-18 1.966e+17  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.553e-16 on 998 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 3.867e+34 on 1 and 998 DF,  p-value: < 2.2e-16
summary(myModel2)
#> 
#> Call:
#> lm(formula = y ~ x2, data = myData)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -1.996e-03 -9.643e-04 -1.996e-05  1.009e-03  2.034e-03 
#> 
#> Coefficients:
#>              Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept) 3.278e-06  3.647e-05     0.09    0.928    
#> x2          1.000e+00  3.766e-05 26550.25   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00115 on 998 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 7.049e+08 on 1 and 998 DF,  p-value: < 2.2e-16
summary(myModel3)
#> 
#> Call:
#> lm(formula = y ~ x3, data = myData)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.214135 -0.097828 -0.003721  0.099000  0.226453 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.001598   0.003602  -0.444    0.657    
#> x3           0.983900   0.003685 266.982   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1136 on 998 degrees of freedom
#> Multiple R-squared:  0.9862, Adjusted R-squared:  0.9862 
#> F-statistic: 7.128e+04 on 1 and 998 DF,  p-value: < 2.2e-16

cor(myData$x1, myData$x2)
#> [1] 0.9999993
cor(myData$x1, myData$x3)
#> [1] 0.9930721
danCloney
  • 73
  • 3