2

I recently ran a regression of the following form:

mod <- lm(log(y) ~ log(x))

To examine how y scales as a function of x. I then examined the top 20 (super-linear scaling with respect to predicted trendline) and bottom 20 residuals (sub-linear scaling with respect to predicted trendline), and noticed that all these observations shared startling regularity in a third variable, z. (all the positive residuals have very large values of z and all the negative residuals have very small values of z.)

I want to be able to demonstrate that this is a meaningful, statistically significant pattern. My intuition was to run

lm(z ~ resid(mod))

but this strikes me as wrong. Is there a way to capture this pattern using residuals, or is this the wrong way of thinking about it altogether?

Re: Heteroskedastic Standard Errors

Here is a QQ plot:

enter image description here

Residuals vs. Fitted Values

enter image description here

The second graph in particular is somewhat alarming. Should I be worried about heteroskedastic standard errors with a huge dataset (300,000 obs), if I'm using robust SEs

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Parseltongue
  • 720
  • 1
  • 9
  • 23
  • What you're noticing is that your errors are heteroskedastic. Focus your attention in that direction. – ColorStatistics Jan 25 '19 at 14:51
  • 1
    Do you mean to model resid(mod) ~ z? That would seem more what you are suggesting. There is no reason not to do what you suggest but it might be more economical to do multiple regression instead. – mdewey Jan 25 '19 at 15:00
  • 1
    @ColorStatistics, I ran a few diagnostic tests to the end. I'm not sure I should be too worried about it (as I run robust standard errors and have a large dataset). Thoughts? – Parseltongue Jan 25 '19 at 15:06
  • @mdewey what would the specification be? I was thinking: `lm(z ~ x + y + x:y)` where all those variables are logged and the last term is an interaction term. – Parseltongue Jan 25 '19 at 15:08
  • Why do you suddenly switch from modelling y (logged) to modelling z? You have changed your scientific question. – mdewey Jan 25 '19 at 15:40
  • Because this mystery variable `z` is very interesting, and I want to see if the observations that have large positive deviances from the regression line (large residuals) have higher values of z than the negative residuals. – Parseltongue Jan 25 '19 at 15:48
  • Residuals are supposed to be random. How are they variable? – Aksakal Jan 25 '19 at 16:37
  • If you post a link to the raw data, I can try fitting it to different surface equations (z = f(x,y) and see what I can find. – James Phillips Jan 25 '19 at 17:27
  • 1
    @Parseltongue: You state that you are not so concerned about heteroskedasticity because you used robust standard errors. Yes, you can use robust standard errors and that will upward adjust the standard errors and downward adjust the the t-statistics for the model you ran. However, the presence of such pronounced heteroskedasticity, I believe, is telling you that your model is incomplete/mis-specified, and that is not something that robust standard errors can do anything about. – ColorStatistics Jan 25 '19 at 18:40
  • @JamesPhillips, unfortunately the dataset is proprietary. Is there any way you can explain the order of operations so I can do it myself? Thanks! – Parseltongue Jan 25 '19 at 19:04
  • I was going to run the data through the 3D surface "function finder" on my zunzun.com open source Python curve and surface fitting web site to see what equations it suggests. The site uses the Differential Evolution genetic algorithm to determine initial parameter estimates when needed, allowing it to search through both linear and non-linear equations, It has hundreds of known, named equations to search through. – James Phillips Jan 25 '19 at 19:47
  • Check https://stats.stackexchange.com/q/127001/35989 – Tim Jul 12 '19 at 19:08

1 Answers1

1

Well, you want to understand how the residuals could depend on z. So the model you should look at would be

lm( resid(mod) ~ z )

but maybe first (show us) the corresponding plot. But, maybe what you see is that the spread of the residuals depend on z not its mean (heteroskedasticity). Then try a model like

lm( Id(resid(mod)^2) ~ z )

(or replace the square with absolute value.) If this turns out in the confirmative, maybe try a more complex model, a gamlss which permits simultaneous estimation of mean and variance functions. You could look through for some examples.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Thanks. Is doing what you suggested above fundamentally different from just running a regression like so: `lm(y ~ x + z)`? – Parseltongue Jul 12 '19 at 17:15
  • Depends on your goals. If you know at the outset that you will use `z` as a predictor, do so, but if you want to see if the simpler model is acceptable, the residuals analysis is fine. – kjetil b halvorsen Jul 13 '19 at 20:36