0

I have a series of data sets that I've fit with non-linear models (same model with different parameters for each data set). I'm trying to model the residuals e so that we can simulate the results y. When I plot the residuals (predicted y_hat - observed y) vs the model predictions y it looks like there is constant variance.

I wanted a less subjective approach than just looking at the plots of the residuals, but the tests I've found are trying to compare across groups. There are no separate groups in this data set. I guess I could partition the data across different levels of y, but I was tried a different approach, and I'm interested for feedback.

Does it make sense to fit a linear model absolute value(e) = slope*y_hat + intercept and just check the p-value for the slope parameter? What if my residuals are non-normal?

  • How is your title about absolute residuals related to your question, which nowhere mentions absolute values? – whuber Feb 08 '22 at 17:50
  • My bad! The last paragraph should say that I'm fitting `absolute value(e) = slope*y_hat + intercept` The idea is that absolute value will give the quantity of error while ignoring direction. – NomNomNomenclature Feb 08 '22 at 20:58
  • I've updated the last paragraph. – NomNomNomenclature Feb 08 '22 at 21:00
  • 1
    Normally people regress the log(abs(studentized residuals) in function of the log(fitted values), see https://www.rdocumentation.org/packages/car/versions/3.0-10/topics/spreadLevelPlot and https://stats.stackexchange.com/questions/74537/log-or-square-root-transformation-for-arima/74594#74594. If you do a power transformation with power equal to 1-the slope of that regression it will more or less equalize your variance. – Tom Wenseleers Feb 08 '22 at 21:09
  • @TomWenseleers thanks for the suggestion. I'm looking to do a check on homogeneity of variance first though. I think a transformation is probably not necessary. I'm basically just wondering if fitting a line to the absolute value of the residuals and checking the p-value on the slope parameter is a reasonable way to check for homogeneity of variance. – NomNomNomenclature Feb 08 '22 at 21:22
  • Yeah one can use that regression to check homogeneity of variance, but as I said it's typically done using log transformed absolute residuals & the log of the predicted values, since the absolute residuals are strictly positive, and so conform better with a log normal distribution... If the absolute residuals are correlated with the predicted values one can resort to using generalised least squares (gls), where one can use weights=varPower() to make variance a power function of the predicted values & the optimal power coefficient is automatically estimated. – Tom Wenseleers Feb 08 '22 at 22:33
  • Thanks @TomWenseleers ! I think I get it now. – NomNomNomenclature Feb 09 '22 at 13:15
  • Using the logs in this setting has nothing to do with lognormal distributions (or even any related assumption). Furthermore, computing a p-value is inappropriate. This is *exploration,* not formal testing; and it is up to the analyst to decide whether there is enough heteroscedasticity evident in the plot to do something about it. That's a question of assessing *magnitude* (*i.e.*, effect size) rather than detectability (*i.e.*, p-values). – whuber Feb 09 '22 at 14:42
  • @whuber What's the technical reason then that this absolute residual vs predicted value regression is typically done on a log-log scale? Is it just that it in that case the slope suggests the appropriate power transformation to fix heteroscedasticity? I can see your point about p value & effect size of the deviation being more important than statistical significance - but then again p values are typically reported for Levene's tests and ncv tests for heteroscedasticity as well (or for normality Shapiro-Wilk's W test p values) (if nonsignificant they at least suggest there is no problem)... – Tom Wenseleers Feb 09 '22 at 15:21
  • 1
    @Tom Good comment about Levene's Test -- a discussion of its ramifications would have to carry us off into a discussion of how Levene's Test (and other such diagnostics) are appropriately used. The answer to your first question is "yes." For instance, the slope on a spread-vs-level plot in log-log coordinates estimates $1-\lambda$ where $\lambda$ is the Box-Cox parameter. The search for a *nonlinear reexpression* of the response variable to achieve homoscedasticity is the underlying motivation rather than trying to fulfill any parametric distributional assumption. – whuber Feb 09 '22 at 17:00
  • @whuber thanks for your comments. You mention that estimating a p-value (presumably of either the slope of `abs(e)` or `log(abs(e))`) is inappropriate and that's it's about magnitude and not detectability. I would think a p-value is useful since it can indicate when a trend of meaningful magnitude is unlikely to be a random manifestation in the data. Perhaps it would be best to look at a p-value *and* evaluate whether the effect size is large enough to matter? If both conditions are satisfied (low p-value, high effect) then proceed to address heteroscedasticity. – NomNomNomenclature Feb 09 '22 at 17:20
  • Yes, you can do that. But you would likely use a different threshold of significance tailored to the different kinds of errors any erroneous decision would create for the overall analysis. In short, use a p-value as a very rough guide to help prevent you from reading too much into variations that might be random rather than structural, but don't be a slave to the 5% (or whatever) level. – whuber Feb 09 '22 at 17:45

0 Answers0