9

I refer to this post which seems to question the importance of the normal distribution of the residuals, arguing that this together with heteroskedasticity could potentially be avoided by using robust standard errors.

I have considered various transformations - roots, logs etc. - and all is proving useless in resolving completely the issue.

Here is a Q-Q plot of my residuals:

Normality plot

Data

  • Dependent variable: already with logarithmic transformation (fixes outlier issues and a problem with skewness in this data)
  • Independent variables: age of firm, and a number of binary variables (indicators) (Later on I have some counts, for a separate regression as independent variables)

The iqr command (Hamilton) in Stata does not determine any severe outliers which rule out normality, but the graph below suggests otherwise and so does the Shapiro-Wilk test.

Cesare Camestre
  • 699
  • 3
  • 15
  • 28
  • 4
    I would not be worried by such a graph, the deviations seem mild enough. If you want you can add confidence bounds to that graph using the `qenv` package. – Maarten Buis Jul 04 '13 at 08:41
  • I am more worried by the shapiro wilk test. I need to prove the deviations are mild enough, rather than use the word "seem" :-) – Cesare Camestre Jul 04 '13 at 08:42
  • A model is a simplification of reality, so you would not expect your model to be _exactly_ true, nor would it be desirable. A true model would no longer be a simplification, and thus no longer do what it is supposed to be doing. So, counter-intuitive as it may seem, "seem" is much more important when deciding between models than "proof". – Maarten Buis Jul 04 '13 at 08:51
  • 4
    I agree with @MaartenBuis that you shouldn't worry too much based on the plot. I would *not* recommend to rely on a formal test of normality (e.g. Shapiro-test) of the residuals. In large samples, the test will [almost always reject the hypothesis](http://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless). [Here](http://stats.stackexchange.com/a/36220/21054) is an informative answer from Glen which addresses exactly the question of formal testing of normality of residuals. – COOLSerdash Jul 04 '13 at 08:54
  • Which brings me to the question about what is "large"; my sample is around 500 @COOLSerdash. Is there some literature I can refer to in my writing on this? – Cesare Camestre Jul 04 '13 at 09:09
  • 4
    See also [this](http://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless) and [this](http://stats.stackexchange.com/a/1723/805). Note also that as your sample size gets larger, your normal assumptions become less critical. Unless you have a *lot* of predictors, such mild non-normality should be of no consequence at all. The problem isn't just that hypothesis tests will reject when samples are large - they answer the wrong question at other sample sizes as well. – Glen_b Jul 04 '13 at 09:17
  • 3
    The $p$-value says that the deviations from normality are larger than one would expect to happen by chance, it does not say that those deviations are large enough to endanger your model. Based on your graph, my judgement call would be that you are fine. – Maarten Buis Jul 04 '13 at 09:18
  • Using the code from [this post](http://stats.stackexchange.com/a/2498/21054) and using a sample size of 500: Even small deviations from normality result in a proportion of about 10.3% *rejected* hypothesis of normality (Shapiro-$p$-value <0.05). This shows that the test is very sensitive to small deviations from normality with an $n=500$. This just reinforces my point that you should not rely on formal testing of normality in residuals. – COOLSerdash Jul 04 '13 at 09:18
  • 5
    What matters is *the effect on your inference*. The only form of inference such a tiny effect would be of any impact at all would be with a prediction interval... and even there, I'd likely use it with little compunction, unless I needed a prediction interval far into the tail (say 99% or more). Of more concern would be issues like dependence and bias and mis-specification of the model for the mean or variance. – Glen_b Jul 04 '13 at 09:20
  • 2
    With $n=500$ and such a close-to-normal looking distribution, the CLT kicks in very quickly, and the uncertainty in the standard error of the mean starts to become very small. Tests of coefficients and confidence intervals should work perfectly well as is. – Glen_b Jul 04 '13 at 09:26
  • @Glen_b Thanks - apart from the statsexchange any evidence I can cite for this? – Cesare Camestre Jul 04 '13 at 09:28
  • 1
    What is it you want to justify in particular? (The evidence that normality tests are more likely to reject as $n$ gets large is easy to see via general reasoning or via simulation. Similarly with the evidence that inference in regression is insensitive to mild non-normality with $n=500$. As $n$ gets larger still the degree of non-normality that you can tolerate increases; eventually you need only that the mean and variance exist if your other assumptions aren't too badly out - you can rely on CLT and Slutsky's theorem.) – Glen_b Jul 04 '13 at 09:32
  • Both statements - would be great.. btw glen/coolserdash/maarten you could have posted these as answers rather than comments since you are really answering my questions! – Cesare Camestre Jul 04 '13 at 09:34
  • 2
    They've been answered before, many times. I could have pointed to another 5 or 6 similar answers, that all pretty much say "look at a display, rather than formally test". – Glen_b Jul 04 '13 at 09:36
  • I'd suggest simulation to show that for a simulation like your particular case, the results are unaffected by worse non-normality than you have. – Glen_b Jul 04 '13 at 09:37
  • And do you think using robust standard errors would help as in the post I linked to? Heteroskedasticity is an issue too, but I'll be using robust standard errors. – Cesare Camestre Jul 04 '13 at 09:38
  • 1
    You could also use bootstrapped standard errors/confidence intervals. I assume that they won't differ much from the conventional ones. – COOLSerdash Jul 04 '13 at 09:45

2 Answers2

9

One way you can add a "test-like flavour" to your graph is to add confidence bounds around them. In Stata I would do this like so:

sysuse nlsw88, clear
gen lnw = ln(wage)

reg lnw i.race grade c.ttl_exp##c.ttl_exp union

predict resid if e(sample), resid

qenvnormal resid, mean(0) sd(`e(rmse)') overall reps(20000) gen(lb ub)

qplot resid lb ub, ms(oh none ..) c(. l l)     ///
    lc(gs10 ..) legend(off) ytitle("residual") ///
    trscale(`e(rmse)' * invnormal(@))          ///
    xtitle(Normal quantiles)

enter image description here

Maarten Buis
  • 19,189
  • 29
  • 59
  • 3
    Note that Stata users need to install `qenv` (by `ssc install qenv`) first. – Nick Cox Jul 04 '13 at 09:34
  • I'll look at this today and see if I'm able to get the confidence bounds – Cesare Camestre Jul 04 '13 at 09:35
  • Getting an error: qenvnormal resid, mean(0) se(`e(rmse)') overall reps(20000) gen(lb ub) - option se() not allowed – Cesare Camestre Jul 04 '13 at 10:05
  • Sounds like an old version of `qenv`. Try tpying in Stata: `ssc install qenv, replace` – Maarten Buis Jul 04 '13 at 10:13
  • Getting all sorts of errors.. I think that se should be sd, but then after taking ages to run now I'm getting . . qplot r lb ub, ms(oh none ..) c(. l l) /// > lc(gs10 ..) legend(off) ytitle("residual") /// > trscale(`e(rmse)' * invnormal(@)) /// > xtitle(Normal quantiles) syntax is qplot plottype varlist ... e.g. qplot scatter mpg ... – Cesare Camestre Jul 04 '13 at 10:21
  • 1
    correct, it should have been `sd()`. It is normal (no pun intended) that `qenv` with the `overall` option takes very long. – Maarten Buis Jul 04 '13 at 10:25
  • I also added the word scatter after qplot, and this produced the graph. Not sure if that was necessary. Coincidentally, I had to install qplots as well. – Cesare Camestre Jul 04 '13 at 10:26
  • it seems like you ran the `qplot` command from the command line without stripping the `///`. Either strip the `///` or run it from a do-file. – Maarten Buis Jul 04 '13 at 10:27
  • I ran it from my dofile.. – Cesare Camestre Jul 04 '13 at 10:27
  • 1
    The help for `qenvnormal` does explain that you need to install `qplot`. You are expected to read the help. More importantly, I guess you are using a very old version of `qplot`. Install from package gr42_6 from http://www.stata-journal.com/software/sj12-1 – Nick Cox Jul 04 '13 at 10:51
  • Correct, works fine now. – Cesare Camestre Jul 04 '13 at 10:56
5

One thing to keep in mind when examining these qq plots is that the tails will tend to deviate from the line even if the underlying distribution is truly normal and no matter how big the N is. This is implied in Maarten's answer. This is because as N gets larger and larger the tails will be farther and farther out and rarer and rarer events. There will therefore always be very little data in the tails and they will always be much more variable. If the bulk of your line is where expected and only the tails deviate then you can generally ignore them.

One way I use to help students learn how to assess their qq plots for normality is generate random samples from a distribution known to be normal and examine those samples. There are exercises where they generate samples of various sizes to see what happens as N changes and also ones where they take a real sample distribution and compare it to random samples of the same size. The TeachingDemos package of R has a test for normality that uses a similar kind of technique.

# R example - change the 1000 to whatever N you would like to examine
# run several times
y <- rnorm(1000); qqnorm(y); qqline(y)
John
  • 21,167
  • 9
  • 48
  • 84
  • Agreed, but this was one of Maarten's key points in his answer and it's why intervals are used to signal uncertainty. – Nick Cox Jul 04 '13 at 10:53
  • Are you suggesting this answer is redundant? I think that part of this is implicit in Maarten's answer but I don't think it's a key point or complete. Maarten's answer is good. This answer is different but related. – John Jul 04 '13 at 11:10
  • It is not redundant, but a cross-reference to Maarten's answer would be likely to help future readers. – Nick Cox Jul 04 '13 at 11:14
  • To be explicit about the link between this and my answer: if you were to look under the hood of `qenv` you would see that this simulation technique is at the core of how the confidence bands are computed. – Maarten Buis Jul 04 '13 at 11:23
  • 1
    added a link... – John Jul 04 '13 at 13:35