5

I am trying to recreate analysis of diamond data in R and compare it to minitab output cited in http://www.amstat.org/publications/jse/v9n2/datasets.chu.html

I'm almost certain its the same data set yet my results are different. I suspect I'm modeling it differently. I also noticed that I'm missing colourD and some others. Did the the minitab model somehow pick different dummys to exclude?

R OUTPUT

require("Ecdat")

data(Diamond)

Diamond$ln_price <- log(Diamond$price)

summary(lm(ln_price~carat+colour+clarity+certification, data=Diamond))


Call:
lm(formula = ln_price ~ carat + colour + clarity + certification, 
    data = Diamond)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31236 -0.11520  0.01613  0.10833  0.36339 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.8011915  0.0496111 137.090  < 2e-16 ***
carat             2.8550130  0.0369676  77.230  < 2e-16 ***
colourE          -0.0295093  0.0404611  -0.729  0.46638    
colourF          -0.1063582  0.0379807  -2.800  0.00544 ** 
colourG          -0.2063494  0.0389848  -5.293 2.35e-07 ***
colourH          -0.2878756  0.0394752  -7.293 2.81e-12 ***
colourI          -0.4165565  0.0413818 -10.066  < 2e-16 ***
clarityVS1       -0.2019313  0.0310634  -6.501 3.41e-10 ***
clarityVS2       -0.2985406  0.0333027  -8.964  < 2e-16 ***
clarityVVS1      -0.0007058  0.0311121  -0.023  0.98192    
clarityVVS2      -0.0966174  0.0289396  -3.339  0.00095 ***
certificationHRD -0.0088557  0.0208641  -0.424  0.67155    
certificationIGI -0.1827107  0.0249516  -7.323 2.33e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1382 on 295 degrees of freedom
Multiple R-squared: 0.9723,     Adjusted R-squared: 0.9712 
F-statistic: 863.6 on 12 and 295 DF,  p-value: < 2.2e-16 

MINITAB OUTPUT

ln_price = 6.08 + 2.86 Carat + 0.417 D + 0.387 E + 0.310 F + 0.210 G + 0.129 H
          + 0.299 IF + 0.298 VVS1 + 0.202 VVS2 + 0.0966 VS1 + 0.0089 GIA
           - 0.174 IGI

Predictor        Coef       StDev          T        P       
Constant      6.07724     0.04809     126.37    0.000
Carat         2.85501     0.03697      77.23    0.000      
D             0.41656     0.04138      10.07    0.000       
E             0.38705     0.03082      12.56    0.000       
F             0.31020     0.02748      11.29    0.000       
G             0.21021     0.02836       7.41    0.000       
H             0.12868     0.02852       4.51    0.000       
IF            0.29854     0.03330       8.96    0.000       
VVS1          0.29783     0.02810      10.60    0.000       
VVS2          0.20192     0.02534       7.97    0.000       
VS1           0.09661     0.02492       3.88    0.000       
GIA           0.00886     0.02086       0.42    0.672       
IGI          -0.17385     0.02867      -6.06    0.000       

S = 0.1382      R-Sq = 97.2%     R-Sq(adj) = 97.1%

Analysis of Variance

Source            DF          SS          MS         F        P
Regression        12     197.939      16.495    863.64    0.000
Residual Error   295       5.634       0.019
Total            307     203.574

UPDATE Glen_b's solution produces the correct results, but the output is not desirable.

                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           6.3846350  0.0411520 155.148  < 2e-16 ***
carat                 2.8550130  0.0369676  77.230  < 2e-16 ***
C(colour, base = 6)1  0.4165565  0.0413818  10.066  < 2e-16 ***
C(colour, base = 6)2  0.3870472  0.0308241  12.557  < 2e-16 ***
C(colour, base = 6)3  0.3101983  0.0274791  11.288  < 2e-16 ***
C(colour, base = 6)4  0.2102072  0.0283593   7.412 1.32e-12 ***
C(colour, base = 6)5  0.1286809  0.0285231   4.511 9.31e-06 ***

Is there a way to get the original factor character displayed instead of the level?

UPDATE

Using relevel to set the base to the same bases as the minitab produces the same results and is still readable.

  Diamond$colour <- relevel(Diamond$colour, ref="I")

  Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
  (Intercept)       6.3846350  0.0411520 155.148  < 2e-16 ***
  carat             2.8550130  0.0369676  77.230  < 2e-16 ***
  colourD           0.4165565  0.0413818  10.066  < 2e-16 ***
  colourE           0.3870472  0.0308241  12.557  < 2e-16 ***
  colourF           0.3101983  0.0274791  11.288  < 2e-16 ***
  colourG           0.2102072  0.0283593   7.412 1.32e-12 ***
  colourH           0.1286809  0.0285231   4.511 9.31e-06 ***
Glen_b
  • 257,508
  • 32
  • 553
  • 939
mikeb
  • 53
  • 5

1 Answers1

6

The model (the specific parameters and X-matrix) is not exactly the same, clearly. Note that the baseline color (the reference group) in R is "D" while in Minitab it's "I" (see there's no coefficient for D in the R output and none for I in the Minitab output?).

Looking through, it looks like there are similar differences in the other categorical variables.

Did the the minitab model somehow pick different dummys to exclude?

Yes.

You can make either one work the same as the other.

For example, see the checked answer to this question for how to change the reference category of a factor in R.

(The question asks about logistic regression but the solution works quite generally.)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 1
    This lead me to relevel which is what I ended up going with – mikeb Dec 02 '12 at 16:36
  • +1. Note that the same is true for clarity (inclusions in the stone), ie, `IF` is missing from R's output, & `VS2` is missing from minitab's output. It appears that minitab uses the *last* category as the reference level, whereas R uses the 1st by default. Because the ref level gets included in the intercept, that value differs, and how the other categories differ from the intercept differ as well. – gung - Reinstate Monica Dec 02 '12 at 18:00