I have several issues with a very simple linear regression. I cannot get Skewness/Kurtosis and Homoscedasticity assumptions to be met, even after removing outliers, adding polynomial terms and using log and Box-Cox transformations.
I have two datasets (sample1 and sample2), both with the columns:
- someX: results of a measurement that's different between sample1 and sample2
- database1: Dollar amounts (in millions) from one database
- database2: Dollar amounts (in millions) from another database
The goal is to do four simple regressions:
- Regression 1: database1 ~ someX (sample1)
- Regression 2: database2 ~ someX (sample1)
- Regression 3: database1 ~ someX (sample2)
- Regression 4: database2 ~ someX (sample2)
I have tried several combinations for each of the four: removing outliers/influential points, using polynomial terms, log and Box-Cox transformations, using per capita values (as someX represents people). But I can only meet all assumptions for (log-log) Regression 1:
lm(formula = log(database1) ~ log(someX) + I(log(someX)^2), data = sample1,
subset = -c(160, 100, 132))
Coefficients:
(Intercept) log(someX) I(log(someX)^2)
0.56298 0.93676 -0.05951
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = mod1)
Value p-value Decision
Global Stat 5.6807 0.22429 Assumptions acceptable.
Skewness 0.3372 0.56144 Assumptions acceptable.
Kurtosis 3.7731 0.05208 Assumptions acceptable.
Link Function 1.4262 0.23239 Assumptions acceptable.
Heteroscedasticity 0.1442 0.70414 Assumptions acceptable.
For the other three, I cannot meet all assumptions. For example for (log-log) Regression 3:
Call:
lm(formula = log(database1) ~ log(someX) + I(log(someX)^2) +
I(log(someX)^3), data = sample2)
Coefficients:
(Intercept) log(someX) I(log(someX)^2) I(log(someX)^3)
-11.90320 7.60609 -1.20549 0.06387
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = mod3)
Value p-value Decision
Global Stat 30.29064 4.271e-06 Assumptions NOT satisfied!
Skewness 21.38611 3.755e-06 Assumptions NOT satisfied!
Kurtosis 1.33551 2.478e-01 Assumptions acceptable.
Link Function 0.05342 8.172e-01 Assumptions acceptable.
Heteroscedasticity 7.51559 6.117e-03 Assumptions NOT satisfied!
And for Box-Cox transformation for regression 3:
Call:
lm(formula = database1.tran ~ log(someX) + I(log(someX)^2) +
I(log(someX)^3), data = sample2)
Coefficients:
(Intercept) log(someX) I(log(someX)^2) I(log(someX)^3)
-22.2267 13.2958 -2.0871 0.1108
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = mod3)
Value p-value Decision
Global Stat 15.26539 0.004181 Assumptions NOT satisfied!
Skewness 1.41446 0.234317 Assumptions acceptable.
Kurtosis 6.35226 0.011723 Assumptions NOT satisfied!
Link Function 0.03864 0.844168 Assumptions acceptable.
Heteroscedasticity 7.46004 0.006308 Assumptions NOT satisfied!
I'm choosing these polynomial terms for regression 3 because it looks like the best fit.
The reproducible code and all data are here: https://github.com/d-paulus/regression-examples
A binder of the code and data is here: https://mybinder.org/v2/gh/d-paulus/regression-examples/HEAD