0

I am trying to model sales as a function of various variables (debt, number of employees, competitors etc.). For this I have transformed both dependent and independent variables using natural logarithm.

The problem is the residuals are not normal as indicated by both their plot and the Shapiro-Wilk test.

I imagine that the log transform can also affect the residuals: could this explain their lack of normality?

Other model stats are looking good, R2 adjusted = 0.92, F test is significant, Resid Std Err = 0.5, and the mean of residuals is 0.

Edit:

Size of dataset: N = 4403; 8 variables in the model: 3 continuous, 5 discrete

enter image description here

cremorna
  • 103
  • 3
  • How big is your data set? Can you show us a Q-Q plot? Rejecting normality is (1) often unimportant and (2) almost inevitable with a big data set. – Ben Bolker Mar 01 '22 at 21:38
  • Edited the original post. Thank you for your comment! – cremorna Mar 01 '22 at 21:52
  • 2
    1. *don't* try to interpret a QQ plot without examining the "prior" plots for the fit of the mean and heteroskedasticity (in R residuals vs fitted and scale-location at a minimum), The QQ plot is only interpretable if the fit and conditional variance assumptions are reasonable. 2. If all that's okay and there's no omitted but potentially important covariates/predictors, you might find a log-link gamma GLM (with logged x) is a better fit for the conditional distribution. – Glen_b Mar 01 '22 at 23:28

1 Answers1

1

Some thoughts:

  • your residuals are left-skewed (the lower/left-hand tail values are smaller/more negative than expected, the upper/right-hand tail values are also smaller/more negative than expected)
  • this probably means you are "overtransforming" your data, i.e. a log transform takes a right-skewed distribution of residuals and converts it to a left-skewed distribution (rather than to a symmetric distribution)
  • you might try a weaker transformation, e.g. square-root, or run a Box-Cox analysis to compare different transformations (?MASS::boxcox or ?car::bcPower in R; if your original data set includes negative values you may have to try one of the alternatives listed at ?car::bcPower)
  • transforming will also affect the fit of your continuous variables (either improving or worsening the fit, hard to say in advance)
  • the violation of the assumption of Normality of the residuals may not be a huge problem (see here); in particular it won't lead to bias, although it may lead to inefficiency/poorly calibrated models ...
Ben Bolker
  • 34,308
  • 2
  • 93
  • 126