0

I have several issues with a very simple linear regression. I cannot get Skewness/Kurtosis and Homoscedasticity assumptions to be met, even after removing outliers, adding polynomial terms and using log and Box-Cox transformations.

I have two datasets (sample1 and sample2), both with the columns:

  • someX: results of a measurement that's different between sample1 and sample2
  • database1: Dollar amounts (in millions) from one database
  • database2: Dollar amounts (in millions) from another database

The goal is to do four simple regressions:

  • Regression 1: database1 ~ someX (sample1)
  • Regression 2: database2 ~ someX (sample1)
  • Regression 3: database1 ~ someX (sample2)
  • Regression 4: database2 ~ someX (sample2)

I have tried several combinations for each of the four: removing outliers/influential points, using polynomial terms, log and Box-Cox transformations, using per capita values (as someX represents people). But I can only meet all assumptions for (log-log) Regression 1:

lm(formula = log(database1) ~ log(someX) + I(log(someX)^2), data = sample1, 
    subset = -c(160, 100, 132))

Coefficients:
    (Intercept)       log(someX)  I(log(someX)^2)  
        0.56298          0.93676         -0.05951  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = mod1) 

                    Value p-value                Decision
Global Stat        5.6807 0.22429 Assumptions acceptable.
Skewness           0.3372 0.56144 Assumptions acceptable.
Kurtosis           3.7731 0.05208 Assumptions acceptable.
Link Function      1.4262 0.23239 Assumptions acceptable.
Heteroscedasticity 0.1442 0.70414 Assumptions acceptable.

For the other three, I cannot meet all assumptions. For example for (log-log) Regression 3:

Call:
lm(formula = log(database1) ~ log(someX) + I(log(someX)^2) + 
    I(log(someX)^3), data = sample2)

Coefficients:
    (Intercept)       log(someX)  I(log(someX)^2)  I(log(someX)^3)  
      -11.90320          7.60609         -1.20549          0.06387  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = mod3) 

                      Value   p-value                   Decision
Global Stat        30.29064 4.271e-06 Assumptions NOT satisfied!
Skewness           21.38611 3.755e-06 Assumptions NOT satisfied!
Kurtosis            1.33551 2.478e-01    Assumptions acceptable.
Link Function       0.05342 8.172e-01    Assumptions acceptable.
Heteroscedasticity  7.51559 6.117e-03 Assumptions NOT satisfied!

And for Box-Cox transformation for regression 3:

Call:
lm(formula = database1.tran ~ log(someX) + I(log(someX)^2) + 
    I(log(someX)^3), data = sample2)

Coefficients:
    (Intercept)       log(someX)  I(log(someX)^2)  I(log(someX)^3)  
       -22.2267          13.2958          -2.0871           0.1108  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = mod3) 

                      Value  p-value                   Decision
Global Stat        15.26539 0.004181 Assumptions NOT satisfied!
Skewness            1.41446 0.234317    Assumptions acceptable.
Kurtosis            6.35226 0.011723 Assumptions NOT satisfied!
Link Function       0.03864 0.844168    Assumptions acceptable.
Heteroscedasticity  7.46004 0.006308 Assumptions NOT satisfied!

I'm choosing these polynomial terms for regression 3 because it looks like the best fit.

Regression 3

The reproducible code and all data are here: https://github.com/d-paulus/regression-examples

A binder of the code and data is here: https://mybinder.org/v2/gh/d-paulus/regression-examples/HEAD

MarianD
  • 1,493
  • 2
  • 8
  • 17
dave
  • 1
  • 2
  • 1
    Re " I cannot get Skewness/Kurtosis ... assumptions to be met:" could you please tell us what those assumptions might be? I am not aware of any assumptions whatsoever about skewness and kurtosis in ordinary linear regression (despite what the `glvma` output might claim). This is mainly because regression is used for so many different purposes that it would be impossible to impose universal assumptions of this nature. For a principled approach to transforming variables in regression, see https://stats.stackexchange.com/a/3530/919 for instance. – whuber Nov 26 '20 at 16:46
  • I got concerned about skewness/kurtosis assumptions because the `glvma` output mentioned them. I thought they might be an indicator for problems in my data. Can I ignore those assumptions then? Thanks for pointing to the thread. I've experimented before with log and Box-Cox transformations because all my variables are heavily positively skewed. After transformation, the residual plots look better but not so much in the Q-Q plots. Removing outliers/influential points or using per capita data does not help. And for Regression 3, heteroscedasticity persists even after transformation. – dave Nov 26 '20 at 17:47
  • From the figure above, it seems to me that there is a little linear correlation between log(database1) and log(someX). Could you check the correlation coefficient? Is it close to 0? If yes, then what you are trying to do might not be useful. – TrungDung Nov 26 '20 at 18:40
  • Correlation coefficient between log(database1) and log(someX) is 0.2148297, p-value: 3.956e-05. Would this be considered as too weak? What would I have to check or try next then? – dave Nov 26 '20 at 19:42

0 Answers0