I am trying to recreate analysis of diamond data in R and compare it to minitab output cited in http://www.amstat.org/publications/jse/v9n2/datasets.chu.html
I'm almost certain its the same data set yet my results are different. I suspect I'm modeling it differently. I also noticed that I'm missing colourD and some others. Did the the minitab model somehow pick different dummys to exclude?
R OUTPUT
require("Ecdat")
data(Diamond)
Diamond$ln_price <- log(Diamond$price)
summary(lm(ln_price~carat+colour+clarity+certification, data=Diamond))
Call:
lm(formula = ln_price ~ carat + colour + clarity + certification,
data = Diamond)
Residuals:
Min 1Q Median 3Q Max
-0.31236 -0.11520 0.01613 0.10833 0.36339
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.8011915 0.0496111 137.090 < 2e-16 ***
carat 2.8550130 0.0369676 77.230 < 2e-16 ***
colourE -0.0295093 0.0404611 -0.729 0.46638
colourF -0.1063582 0.0379807 -2.800 0.00544 **
colourG -0.2063494 0.0389848 -5.293 2.35e-07 ***
colourH -0.2878756 0.0394752 -7.293 2.81e-12 ***
colourI -0.4165565 0.0413818 -10.066 < 2e-16 ***
clarityVS1 -0.2019313 0.0310634 -6.501 3.41e-10 ***
clarityVS2 -0.2985406 0.0333027 -8.964 < 2e-16 ***
clarityVVS1 -0.0007058 0.0311121 -0.023 0.98192
clarityVVS2 -0.0966174 0.0289396 -3.339 0.00095 ***
certificationHRD -0.0088557 0.0208641 -0.424 0.67155
certificationIGI -0.1827107 0.0249516 -7.323 2.33e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1382 on 295 degrees of freedom
Multiple R-squared: 0.9723, Adjusted R-squared: 0.9712
F-statistic: 863.6 on 12 and 295 DF, p-value: < 2.2e-16
MINITAB OUTPUT
ln_price = 6.08 + 2.86 Carat + 0.417 D + 0.387 E + 0.310 F + 0.210 G + 0.129 H
+ 0.299 IF + 0.298 VVS1 + 0.202 VVS2 + 0.0966 VS1 + 0.0089 GIA
- 0.174 IGI
Predictor Coef StDev T P
Constant 6.07724 0.04809 126.37 0.000
Carat 2.85501 0.03697 77.23 0.000
D 0.41656 0.04138 10.07 0.000
E 0.38705 0.03082 12.56 0.000
F 0.31020 0.02748 11.29 0.000
G 0.21021 0.02836 7.41 0.000
H 0.12868 0.02852 4.51 0.000
IF 0.29854 0.03330 8.96 0.000
VVS1 0.29783 0.02810 10.60 0.000
VVS2 0.20192 0.02534 7.97 0.000
VS1 0.09661 0.02492 3.88 0.000
GIA 0.00886 0.02086 0.42 0.672
IGI -0.17385 0.02867 -6.06 0.000
S = 0.1382 R-Sq = 97.2% R-Sq(adj) = 97.1%
Analysis of Variance
Source DF SS MS F P
Regression 12 197.939 16.495 863.64 0.000
Residual Error 295 5.634 0.019
Total 307 203.574
UPDATE Glen_b's solution produces the correct results, but the output is not desirable.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.3846350 0.0411520 155.148 < 2e-16 ***
carat 2.8550130 0.0369676 77.230 < 2e-16 ***
C(colour, base = 6)1 0.4165565 0.0413818 10.066 < 2e-16 ***
C(colour, base = 6)2 0.3870472 0.0308241 12.557 < 2e-16 ***
C(colour, base = 6)3 0.3101983 0.0274791 11.288 < 2e-16 ***
C(colour, base = 6)4 0.2102072 0.0283593 7.412 1.32e-12 ***
C(colour, base = 6)5 0.1286809 0.0285231 4.511 9.31e-06 ***
Is there a way to get the original factor character displayed instead of the level?
UPDATE
Using relevel to set the base to the same bases as the minitab produces the same results and is still readable.
Diamond$colour <- relevel(Diamond$colour, ref="I")
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.3846350 0.0411520 155.148 < 2e-16 ***
carat 2.8550130 0.0369676 77.230 < 2e-16 ***
colourD 0.4165565 0.0413818 10.066 < 2e-16 ***
colourE 0.3870472 0.0308241 12.557 < 2e-16 ***
colourF 0.3101983 0.0274791 11.288 < 2e-16 ***
colourG 0.2102072 0.0283593 7.412 1.32e-12 ***
colourH 0.1286809 0.0285231 4.511 9.31e-06 ***