2

I am doing stepwise regression as following

fit1 = lm (y_train ~ ., data = dat)
step = stepAIC(fit1, direction = "forward")
Error in stepAIC(fit1, direction = "forward") : 
  AIC is -infinity for this model, so 'stepAIC' cannot proceed

> length(y_train)
[1] 132
> dim(x_train)
[1]  132 1501

I searched on google, but it does not solve the problem.

Jill Clover
  • 615
  • 1
  • 6
  • 17
  • Did you get any warnings from your lm fit? – mdewey Oct 20 '16 at 17:13
  • No, I did not. I also checked the data set. It is ok. – Jill Clover Oct 20 '16 at 17:36
  • 4
    You should be happy about it since it prohibits you from using model selection method that leads to bad, overfitted models: http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 – Tim Oct 20 '16 at 17:44
  • 3
    Since $\text{AIC} = - 2 \log [ \mathcal{L} (\hat{\theta}) ] + 2p$ it would seem that $\mathcal{L} (\hat{\theta}) = \infty$ which suggests a perfect fit. Also, why are you beginning with all the variables and doing forward selection? – dsaxton Oct 20 '16 at 18:57
  • 2
    I also just noticed you have 1,500 variables and only 130 observations, which is clearly the reason for the error. – dsaxton Oct 20 '16 at 22:07
  • @dsaxton Can you explain it a little bit? Appreciated. – Jill Clover Oct 20 '16 at 22:08
  • I know backward is impossible but forward is ok for $p > n$ – Jill Clover Oct 20 '16 at 22:09
  • You have way more parameters than observations so you have a perfect fit. You need to start with far fewer variables, and also "forward" does not make sense when you already have a fully saturated model. – dsaxton Oct 21 '16 at 03:37

1 Answers1

2

The negative infinity in AIC infers very overfitted model in the model selection. It is fortunate that your stepAIC was stopped. A working demo below.

Validate the quality of your original data. In model selection such as Forward Stepwise, we have a special condition called breakdown point which is needed to ensure the quality of the model.

This thesis find that when p > n, there is a breakdown point for standard model selection schemes, such that model selection only works well below a certain critical complexity level depending on n/p. This notion is applied to some standard model selection algorithms (Classical Forward Stepwise, Forward Stepwise with False Discovery Rate thresholding, Lasso, LARS, and Stagewise Orthogonal Pursuit) in the case where p n. (Source)

enter image description here

which on about the page 58 explains more about how things such as sparsity and noise level affect the model selection breakdown ponit.

The model selection breakdown point is worrying with 1500 variables and 300 observations. As mentioned by dsaxton

I also just noticed you have 1,500 variables and only 130 observations, which is clearly the reason for the error..

which here means that your $n/p=\frac{1500}{130}=11.53...$ that is far from a thumb rule such as at least 2 times more observations than variables.

Example about a working demo

> library(MASS)
> dat<-USJudgeRatings[,1]; 
> y_train<-USJudgeRatings[,2]; 
> fit1 = lm (y_train ~ dat)
> fit1

Call:
lm(formula = y_train ~ dat)

Coefficients:
(Intercept)          dat  
      8.832       -0.109  

> stepAIC(fit1, direction="forward")
Start:  AIC=-20.24
y_train ~ dat


Call:
lm(formula = y_train ~ dat)

Coefficients:
(Intercept)          dat  
      8.832       -0.109  

Related questions

  1. What to do if number of features is much larger than number of observations?

  2. Stanford Doctor Thesis about model selection and finding the breakdown in the case of some simple models such as regressions here

hhh
  • 283
  • 3
  • 17