0

I would like to fit a linear regression model in R for predicting motorbike prices. My dataset has 13 variables, including number of kilometers driven, colour, month of the first registration, etc. The cells of some variables are empty.

Which variables should I use for the linear regression model? Should I use all 13 variables, or should I detect the relevant variables through doing Anova step-by-step and to delete in every step the variable which is not significant, or should I use the Akaike information criterion? What is the best way for the right model?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Chris Ku
  • 11
  • 1
  • 2
  • You should read some books on regression, imputation and machine learning. – Roland Dec 23 '15 at 10:53
  • @Roland, that might be a little too much to do for someone with a concrete problem, although I agree that you pointed out the right keywords. – Richard Hardy Dec 23 '15 at 13:41

1 Answers1

0

The best solusion to your problem is called LASSO (Least Absolute Shrinkage and Selection Operator). As the name suggests, this method estimates the coefficients and model simultaneously. Meaning, it pushes some coefficients toward zero. In other words it removes some coefficients. The model is similar to linear regression but LASSO imposes a penalty term on likelihood. Then the likelihood is like $$ logLik= \sum_{i=1}^n (y-x\beta)^2 +\lambda\sum_{j=1}^{p}|\beta_i| $$ where $\beta$ is a vector containing $p$ coefficients $\beta=(\beta_1,...,\beta_p)$ and $\lambda\geq0$ is called tuning parameter. So optimizing on unknown parameters, $\beta$ and for a fixed $\lambda$ results in some coefficients to be zero. There are a huge amount of papers and books about LASSO. just google the key word, LASSO regression and you will see a lot of results. Finally, there is a package in R called lars that you can test lasso on your dataset.

PS: if you have some missing data in your dataset, you should first solve the problem of missings and them apply LASSO.

TPArrow
  • 2,155
  • 11
  • 22