2

I have been trying to apply a linear regression, however, every approach I follow seems to result to the same issue: the homogeneity of variance of the residuals is violated and I have been trying to understand why is that and how to proceed. The dependent variable (maximum speed difference between an observed car and surrounding traffic) is the following:

histogram of the dependent variable

The goal is to identify the impact of other variables on this speed difference. My explanatory variables include weather conditions (categorical), traffic conditions (categorical), type of vehicle (categorical), acceleration (continuous) and absolute speed (continuous). I have seen in many topics here that the outcome variable does not nescessarily have to follow a normal distribution, tha normality assumption refers to the residuals, so I have been trying to apply a linear regression. The normality assumption seems ok based on the Q-Q plot, however, I think the homogeneity of variance of the residuals is violated:

Residuals vs Fitted Values

So, based on this, I have been trying to adjust my model. I checked a subsample without the negative values (it would make sense for my research to try that too), I applied a log transformation and a squared transformation on this subsample. I even tried some different explanatory variables but still the plots are similar. In all cases, the normality assumption of the residuals is ok and there is no multicolinearity problem based on the VIF test. I also tested for highly influential variables with cook's distance and there is not an issue here either.

My questions are:

  1. Why is this happening? Could it be possible that I should not use simple linear regression in the first place?
  2. What other approaches can I follow to overcome this issue?
Anna
  • 53
  • 4
  • 1
    Apparently you have sampled somewhere around 2000 observations which probably was a considerable effort. You should put some more effort in explaing what these data are and mostly shat the purpose of the regression is. Given such large numbers you model might be good for what is it used for and if not, who should give you other approaches when we do not know what is the goal here? – Bernhard May 05 '21 at 11:04
  • @Bernhard Thank you for comment. I now updated my question with additional info to make the goal clearer. – Anna May 05 '21 at 13:35
  • I'm not sure I'm following you when you mention simple linear regression: this would be a convenient model with one continuous IV, whereas here you have several factors + several covariates. It would be interesting to make several plots to see what's really going on here depending on the levels of the factors. – Arnaud Mortier May 05 '21 at 13:53
  • 1
    "Maximum speed difference" presents two challenges. First, differences often are difficult to analyze and often do not improve under any kind of transformation. You can make progress by analyzing the two speeds as a bivariate response. Second, maxima tend to have skewed distributions with high variability. An alternative choice of statistic to characterize speed differences is likely to help. BTW, you get apparent normality of residuals only because you are mixing together a wide range of distributions: you need to cure the heteroscedasticity before taking such a univariate look at residuals. – whuber May 05 '21 at 14:28
  • Thanks everyone for your comments. It's true, as I noticed from several trials that the difference presents difficulties in the analysis, which do not appear when I focus on the speeds alone. – Anna May 07 '21 at 11:11

1 Answers1

0

It seems that your concern with your final model is that the homogeneity of the variance of the error term appears to be violated. In this case, with something of a "megaphone" shape in the data, one possible "fix" is to run your regression with the log of the dependent variable. As you have negative values, you will need to apply a shift before you run the regression, say for example $y^\prime = \ln(y+10)$...then run your regression with $y^\prime$ and examine the residuals. This is one approach to address your second question. your first question is a bit too contextually specific for me to provide any advice. (I.e., the model depends on the context and theory, not just on the statistical analysis of possible issues with the model.)

Gregg H
  • 3,571
  • 6
  • 25
  • 1
    Adding 10 to cope with the negative responses is *ad hoc.* It rarely works to cure the problems with data that are related to differences. See my comment to the question about better approaches and see https://stats.stackexchange.com/questions/30728 for a discussion of adding a "start" value to logarithms. – whuber May 05 '21 at 18:38