0

I have a problem to build and to explain the linear multiple regression.

I have a data set called Cars93 with 26 variables (numeric and not numeric) and 93 observations. This data set you can find in the MASS R package. I want to build a linear regression model for predicting the price of a car. Then I have to do a variable selection (forward and backward stepwise) using AIC and BIC in R. My knowledge in R is too little that´s why I have some problems. I really hope you can help me!

1) The data set has some missing values

I just solved this problem like this:

 Cars93 [! Complete.cases (Cars93)] 
 Cars93new <- na.omit (Cars93) 
 Cars93 = Cars93new 

I think some informations are going lost. Is there another solution to eliminate the missing values?

2) Some variables from the dataset are not numeric I tried to convert these values into numerical values like this:

Cars93 $ airbags = factor (Cars93 $ airbags, labels = c (2,1,0)) 
Cars93 $ airbags 
  [1] 0 2 1 2 1 1 1 1 1 1 2 0 1 2 0 1 2 2 1 0 1 1 1 1 0 2 0 0 0 1 1 1 1 0 1 1 2 2 
[39] 0 0 0 0 1 1 2 2 2 0 0 1 1 2 1 0 0 1 1 1 1 0 1 1 0 0 0 2 0 2 1 1 0 0 1 0 1 1 
[77] 1 0 0 0 1 2 
Levels: 2 1 0 

I did the same with other not numeric variables.

Afterwards I tried to build a linear model regression with all variables:

Modell=lm(Price~Horsepower+EngineSize+MPG.city+MPG.highway+Rev.per.mile+Man.trans.avail+Fuel.tank.capacity+Passengers+Length+Wheelbase+Width+Turn.circle+Weight+Rear.seat.room+Luggage.room+Origin+AirBags+Type+Cylinders+Weight+PRM)
summary(Modell)

But the output does make any sense:

Call:
lm(formula = Price ~ Horsepower + EngineSize + MPG.city + MPG.highway + 
    Rev.per.mile + Man.trans.avail + Fuel.tank.capacity + Passengers + 
    Length + Wheelbase + Width + Turn.circle + Weight + Rear.seat.room + 
    Luggage.room + Origin + AirBags + Type + Cylinders + Weight + 
    RPM)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4893 -2.3664 -0.0062  2.1180 18.1112 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        81.335018  37.993697   2.141 0.036826 *  
Horsepower          0.123535   0.049355   2.503 0.015372 *  
EngineSize         -0.615828   3.047223  -0.202 0.840602    
MPG.city           -0.392888   0.470385  -0.835 0.407259    
MPG.highway         0.013646   0.428978   0.032 0.974740    
Rev.per.mile        0.001498   0.002511   0.597 0.553206    
Man.trans.availYes -1.600967   2.480497  -0.645 0.521387    
Fuel.tank.capacity  0.462731   0.572169   0.809 0.422219    
Passengers          0.615593   1.823089   0.338 0.736925    
Length              0.074875   0.130511   0.574 0.568547    
Wheelbase           0.740146   0.343760   2.153 0.035796 *  
Width              -1.745792   0.571082  -3.057 0.003473 ** 
Turn.circle        -0.695287   0.415708  -1.673 0.100203    
Weight             -0.004068   0.006255  -0.650 0.518171    
Rear.seat.room      0.101150   0.420050   0.241 0.810619    
Luggage.room        0.176183   0.367199   0.480 0.633306    
Originnon-USA       1.881047   1.762845   1.067 0.290696    
AirBagsDriver only -3.294049   1.888346  -1.744 0.086777 .  
AirBagsNone        -8.535307   2.289737  -3.728 0.000464 ***
TypeLarge          -1.692122   3.999146  -0.423 0.673887    
TypeMidsize         2.684947   2.639047   1.017 0.313504    
TypeSmall           1.913341   2.896592   0.661 0.511710    
TypeSporty          4.686129   3.268426   1.434 0.157407    
Cylinders4         -3.126727   4.554852  -0.686 0.495360    
Cylinders5         -4.732933   7.498898  -0.631 0.530605    
Cylinders6          0.224795   5.695793   0.039 0.968664    
Cylinders8          4.020677   7.255406   0.554 0.581755    
RPM                -0.002778   0.002450  -1.134 0.261805    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.009 on 54 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8313,    Adjusted R-squared:  0.747 
F-statistic: 9.859 on 27 and 54 DF,  p-value: 1.014e-12

I have 4 times "Cylinders" and "Type" and twice "AirBag" in the summary. I dont know why... And only 4 variables are significant in the model. Can somebody tell me, where is the mistake in my model?

I also would like to know how to test other assumptions in R for multiple linear model.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
angersan
  • 1
  • 2
  • It's treating cylinders as a categorical variable (should it be?). Same thing with the other variables. –  Aug 18 '14 at 21:19
  • Tell what assumptions you want to test and we can better help you understand what you need to explore to test the assumptions. –  Aug 18 '14 at 21:20
  • 4
    This is too broad, & there are too many questions here for a simple Q&A site like CV. What you need is to take a regression course, or read a good regression textbook. If you can focus the question, we may be able to help you w/ specific issues. Regarding the results for non-numeric variables, read some of our threads on [tag:categorical-data]. For the missing data, try our threads on [tag:missing-data]. Also, you should not use stepwise selection, see [here](http://stats.stackexchange.com/a/20856/7290). – gung - Reinstate Monica Aug 18 '14 at 21:21
  • Yes, the variables (cylinder, origin, airbags, type, drivetrain etc. are categorial. Can I use these variables just like they are (numeric) or should I convert them in R? Can you tell me how (R-command)? Ich have to build a multiple linear model with these variables... – angersan Aug 19 '14 at 18:17

1 Answers1

5

The biggest impediment is unfamiliarity with R. The Internet is overflowing with R tutorials in flavors for every academic discipline. Stack Overflow has thousands of questions about how to do things in R; searching the archives, using keywords from this answer, will set you on the right track.

1) The technical name for your solution for missing values is called casewise deletion. This can create problems when the missingness of the observation depends on its value. The classic example of this is survey questions about a person's income. People with very high or very low income might be less likely to answer those questions, but simply omitting these people can bias estimates which depend on income as a predictor.

Handling missing data is a huge topic in statistics. One approach is called imputation, which attempts to fill in the missing values by using observed values as a guide. But there are many others. Search the missing data tag for more information.

2) You didn't actually convert airbags to a numerical value, you just made it a factor with labels "0", "1", and "2". If you converted it to a factor and labeled it "A", "B" and "C" or "Tom", "Dick", and "Harry", R would treat it the same way, because factors are regarded as categories in R: political parties or religions or brands of cars are all logically represented as factors; numbers are not.

You'll have to coerce those values to numeric quantities (reals or integers) in order to treat them as numbers.

3) Again, this has to do with data types. For a factor with $k$ levels, R will estimate coefficients for $k-1$ of the levels because the $k^{\text{th}}$ level will be collinear with the intercept. Each of the $k$ levels is a binary vector, $1$ when the observation is of category $k$ and $0$ otherwise. The first category alphabetically is used as the reference category and is included with the intercept. The remaining levels are reported in the model summary.

4) Your last questions are far too broad to be answered here. The literal answer to why your model has only a few "signficiant" features is that the coefficents are too small relative to their standard errors, so they are plausibly zero. More detail can be found in a standard regression textbook, or in the CV archives.

Another answer is that you either need a better model, more data, or both. If this model was selected with stepwise AIC, it can be shown that the selection procedure is roughly the same as only keeping features with p-values $\le 0.16$. It's widely acknowledged in the statistical community that stepwise feature selection is a minimum logic estimator, rather than a useful tool to pick out meaningful relationships in data.

My recommendation to improving your model is to take a step back and really think about what features might increase or decrease a car's price. Do you really think that many people decide to buy a car based on its gas tank volume? Or length? On the other hand, for the same reason that a designer purse commands an exorbitant price, a car's brand may be important, but this doesn't appear to figure into your model.

The final question, how to check your assumptions, can only reasonably be answered by reading up on regression. One resource is here. But I can't overstate the value of reading a good textbook, especially when getting started.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • +1 This is a good answer to a nearly unanswerable question. But I'm pretty sure you meant to open by stating that *unfamiliarity* with `R` is an impediment, not familiarity with it! – whuber Aug 18 '14 at 21:51
  • Glad to help! And good luck! – Sycorax Aug 19 '14 at 18:28
  • 2) For example: I have this data set i an excel-file whera all these categorial variables are numeric (e.g."origin"= 0 or 1, "drive-train"=0 or 1, or 2 etc). Can i build with these variables a multiple linear model. Or should i transform them in any way befor i build such a model. – angersan Aug 19 '14 at 18:30
  • And if that´s the case how should I convert the categorial variables in R? I find a lot if information in i-net, but not the right R-Command... – angersan Aug 19 '14 at 18:32
  • You'll want to get comfortable with building up a set of references and resources to look up these kinds of questions. Places to start are R's documentation, Stack Overflow, web tutorials, and textbooks about programming in R. All of these offer different types of insights into how to program in R. – Sycorax Aug 19 '14 at 18:36