I have a problem to build and to explain the linear multiple regression.
I have a data set called Cars93
with 26 variables (numeric and not numeric) and 93 observations. This data set you can find in the MASS
R package. I want to build a linear regression model for predicting the price of a car. Then I have to do a variable selection (forward and backward stepwise) using AIC and BIC in R. My knowledge in R is too little that´s why I have some problems. I really hope you can help me!
1) The data set has some missing values
I just solved this problem like this:
Cars93 [! Complete.cases (Cars93)]
Cars93new <- na.omit (Cars93)
Cars93 = Cars93new
I think some informations are going lost. Is there another solution to eliminate the missing values?
2) Some variables from the dataset are not numeric I tried to convert these values into numerical values like this:
Cars93 $ airbags = factor (Cars93 $ airbags, labels = c (2,1,0))
Cars93 $ airbags
[1] 0 2 1 2 1 1 1 1 1 1 2 0 1 2 0 1 2 2 1 0 1 1 1 1 0 2 0 0 0 1 1 1 1 0 1 1 2 2
[39] 0 0 0 0 1 1 2 2 2 0 0 1 1 2 1 0 0 1 1 1 1 0 1 1 0 0 0 2 0 2 1 1 0 0 1 0 1 1
[77] 1 0 0 0 1 2
Levels: 2 1 0
I did the same with other not numeric variables.
Afterwards I tried to build a linear model regression with all variables:
Modell=lm(Price~Horsepower+EngineSize+MPG.city+MPG.highway+Rev.per.mile+Man.trans.avail+Fuel.tank.capacity+Passengers+Length+Wheelbase+Width+Turn.circle+Weight+Rear.seat.room+Luggage.room+Origin+AirBags+Type+Cylinders+Weight+PRM)
summary(Modell)
But the output does make any sense:
Call:
lm(formula = Price ~ Horsepower + EngineSize + MPG.city + MPG.highway +
Rev.per.mile + Man.trans.avail + Fuel.tank.capacity + Passengers +
Length + Wheelbase + Width + Turn.circle + Weight + Rear.seat.room +
Luggage.room + Origin + AirBags + Type + Cylinders + Weight +
RPM)
Residuals:
Min 1Q Median 3Q Max
-9.4893 -2.3664 -0.0062 2.1180 18.1112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.335018 37.993697 2.141 0.036826 *
Horsepower 0.123535 0.049355 2.503 0.015372 *
EngineSize -0.615828 3.047223 -0.202 0.840602
MPG.city -0.392888 0.470385 -0.835 0.407259
MPG.highway 0.013646 0.428978 0.032 0.974740
Rev.per.mile 0.001498 0.002511 0.597 0.553206
Man.trans.availYes -1.600967 2.480497 -0.645 0.521387
Fuel.tank.capacity 0.462731 0.572169 0.809 0.422219
Passengers 0.615593 1.823089 0.338 0.736925
Length 0.074875 0.130511 0.574 0.568547
Wheelbase 0.740146 0.343760 2.153 0.035796 *
Width -1.745792 0.571082 -3.057 0.003473 **
Turn.circle -0.695287 0.415708 -1.673 0.100203
Weight -0.004068 0.006255 -0.650 0.518171
Rear.seat.room 0.101150 0.420050 0.241 0.810619
Luggage.room 0.176183 0.367199 0.480 0.633306
Originnon-USA 1.881047 1.762845 1.067 0.290696
AirBagsDriver only -3.294049 1.888346 -1.744 0.086777 .
AirBagsNone -8.535307 2.289737 -3.728 0.000464 ***
TypeLarge -1.692122 3.999146 -0.423 0.673887
TypeMidsize 2.684947 2.639047 1.017 0.313504
TypeSmall 1.913341 2.896592 0.661 0.511710
TypeSporty 4.686129 3.268426 1.434 0.157407
Cylinders4 -3.126727 4.554852 -0.686 0.495360
Cylinders5 -4.732933 7.498898 -0.631 0.530605
Cylinders6 0.224795 5.695793 0.039 0.968664
Cylinders8 4.020677 7.255406 0.554 0.581755
RPM -0.002778 0.002450 -1.134 0.261805
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.009 on 54 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8313, Adjusted R-squared: 0.747
F-statistic: 9.859 on 27 and 54 DF, p-value: 1.014e-12
I have 4 times "Cylinders" and "Type" and twice "AirBag" in the summary. I dont know why... And only 4 variables are significant in the model. Can somebody tell me, where is the mistake in my model?
I also would like to know how to test other assumptions in R for multiple linear model.