2

I searched online and looked video tutorials but I'm still not sure. Would you consider the below data normally distributed? I know the ideal fit in theory would be that most of the points are on the line. However data in the real world can be different. So would like to hear your opinion from a practical point of view. Would it be safe to perform a regression analysis on this dataset?

enter image description here

enter image description here

enter image description here

--------------------UPDATED INFORMATION------------------------

Skewness .291 Excess Kurtosis 2.489

Both Shapiro and Kolmogorov show significance at .000 level (therefore not normal)

enter image description here

Ben
  • 91,027
  • 3
  • 150
  • 376
JohnKimble
  • 79
  • 5
  • Regression does not assume that your $X$ or $Y$ variables are normally distributed. – Alexis Aug 20 '18 at 21:44
  • 2
    Sorry I should have been more clear, this is the output for the residuals (Y-axis = Zresiduals and X-axis = Zpredictors. I followed this tutorial to check the assumptions on the model https://youtu.be/liiDHEeEH_I – JohnKimble Aug 20 '18 at 21:51
  • I have added the Q-Q plot in the OP – JohnKimble Aug 21 '18 at 08:55
  • 1. The QQ plot does clearly suggest heavy tails. 2. You have some indication of heteroskedasticity but it's moderate; it looks like it partly accounts for the kurtosis but I believe there would still be excess kurtosis after you adjusted for it. – Glen_b Aug 21 '18 at 09:05
  • Adding a Q-Q plot is helpful (and fixes the title). It's a moot point now whether the P-P plot serves much purpose although an experienced eye would see the systematic curvature indicative of fatter tails. I presume that the kurtosis you cite is so-called excess kurtosis (a scale on which the normal has zero excess kurtosis). It's not kurtosis as originally defined by Pearson. Are these SPSS results (not important to your question, but of interest to me as SPSS conventions are often idiosyncratic)? – Nick Cox Aug 21 '18 at 09:06
  • 1
    @John is that kurtosis figure you gave actual kurtosis (average 4th standardized moment) or is it excess kurtosis? – Glen_b Aug 21 '18 at 09:07
  • @Glen_b I believe its the actual kurtosis. I have added the output for convenience https://i.imgur.com/zxR0OE0.png – JohnKimble Aug 21 '18 at 09:12
  • 1
    This is SPSS you're using? That would generally use excess, I believe. – Glen_b Aug 21 '18 at 09:17
  • Yes its from SPSS. This output is generated via the explore function. From what I have read on the internet is that SPSS reports the actual kurtosis. – JohnKimble Aug 21 '18 at 09:22
  • Doesn't SPSS document its own procedures? I really wouldn't trust anything else "on the internet". If nothing else you can fire up a sample of random normal deviates. If the reported kurtosis is about 3, that's kurtosis strict sense. If it is about 0 that is excess kurtosis. – Nick Cox Aug 21 '18 at 09:24
  • I cant find anything in the official SPSS documentation. However I used these sources https://stats.stackexchange.com/questions/61740/differences-in-kurtosis-definition-and-their-interpretation and https://www.researchgate.net/post/What_do_I_do_if_my_data_distribution_is_not_Normal I can confirm based on my own test that SPSS reports exactly the same kurtosis value as Excel – JohnKimble Aug 21 '18 at 09:38
  • 2
    @Glen_b and Nick I correct my answer, I believe its the excess kurtosis reported by SPSS, since its equal to Excel's KURT function. If I enter the values "2, 3, 4, 5 and 6'' in SPSS and run the descriptive analysis, it shows a skewness of 0 and kurtosis of -1,2 – JohnKimble Aug 21 '18 at 09:50
  • Since the Q-Q plot indicates that there is some heavy-tails would it make it sense to delete the observed outliers from the plot and then run a regression (as part of robustness test)? – JohnKimble Aug 21 '18 at 10:09
  • 1
    Excel really isn't a standard for statistics calculations but you've confirmed informed guesses from @Glen_b and me that you're showing results for excess kurtosis. A uniform distribution has kurtosis 1.8 and excess kurtosis $-$1.2. Kurtosis must be $\ge$ 1. – Nick Cox Aug 21 '18 at 10:50
  • 1
    I wouldn't delete these outliers without a substantive reason for them being produced by incorrect data or a data-independent reason for them being irrelevant to your purpose. I see no obvious reason for thinking your regression to be wrong, beyond P-values and confidence intervals being a little off. A more appropriate model might be based on a t-distribution for errors. You might need to use software other than SPSS for that. – Nick Cox Aug 21 '18 at 10:51
  • 1
    If you want advice on your regression you'll need most of all to tell us more about your predictors and what checks on linear structure you've carried out. – Nick Cox Aug 21 '18 at 11:01
  • Thanks for the reply. I think I will keep that out of scope from this topic. – JohnKimble Aug 21 '18 at 11:23
  • 1
    @JohnKimble: Since the comments confirm that the reported kurtosis statistic is the *excess* kurtosis, I have taken the liberty of editing the question accordingly. – Ben Aug 21 '18 at 13:43

2 Answers2

3

You should calculate and report the sample skewness and kurtosis of your residual distribution. Even without this, it appears from your histogram that it is probably leptokurtic; it has a higher peak, lower shoulders and fatter tail than the normal distribution. From the histogram it looks quite close to a Pearson Type VI distribution with positive excess kurtosis and possibly some slight positive skew. Fitting the distribution to this family would probably give a reasonable fit.

Deviation from normality of errors is not fatal for a regression model, since many of the results are robust to deviations from this distributional assuption. This deviation from normality means that your underlying error distribution is probably slightly leptokurtic. Your coefficient estimates should still be fine, but you will want to take the excess kurtosis into account if you construct prediction intervals for individual values. The excess kurtosis means that there is a higher probability of high errors in either direction than would be predicted by the normal regression model.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • I might mention in the paper that the distribution looks quite close to a Pearson Type VI distribution with some positive excess kurtosis as you mentioned, followed by a statement like "...accordingly this distribution would probably give a reasonable fit...'' Is there a academic paper I can reference as additional support for this conclusion? (sorry I'm quite new to this). I did found this paper, http://www.academicjournals.org/article/article1379928288_Lahcene.pdf but I found it hard to link this, without diving too much into it – JohnKimble Aug 21 '18 at 16:07
  • 1
    If you want to make this claim in a paper, you need to actually fit the data to that distribution and establish that it is a better fit than the normal (e.g., through goodness-of-fit statistics). You could reference a paper on the Pearson distribution, but obviously there will be no paper that asserts that your data has a reasonable fit to this distribution. (If there were, it would be the most prescient paper ever!) If you just want general references to the distribution, I would start by looking at the references [here](https://en.wikipedia.org/wiki/Pearson_distribution#Sources). – Ben Aug 21 '18 at 22:47
0

To answer my own question based on discussion with others: The data looks quite close to being normally distributed. No distribution with real data is exactly normal, there will be always small deviations. In this case its quite close to normal and can be therefore treated as normal distribution.

JohnKimble
  • 79
  • 5
  • Your residuals will not actually be from a normal distribution ("*all models are wrong*"). However, they do appear reasonably consistent with having come from a normal distribution. Your title has the right question (yes, the residual displays are fairly consistent with approximately normal errors) but your answer says something you can't really support on the evidence ("it is normally distributed"). – Glen_b Aug 20 '18 at 23:39
  • You are right, I will adjust the wording of the conclusion – JohnKimble Aug 20 '18 at 23:58
  • 2
    This is distinctly non-normal: this distribution has noticeably higher probability in the center (negative excess kurtosis). Whether it could be treated as Normal depends on what you will be doing with the data. "Regression analysis" is too vague to permit further comment. – whuber Aug 21 '18 at 00:16
  • Basically I want to perform a linear regression. The dependent variable is the cumulative abnormal return (CAR) and the independent variables are among others earnings. – JohnKimble Aug 21 '18 at 00:21
  • There are also somewhat heavier tails. We should look for lack of fit in the mean and for heteroskedasticity (in that order) before trying to interpret these displays too closely; if there's anything going on there we will get a mistaken impression. If these are stock returns, you'd **expect** them to be non-normal (typically: heavy tails, high peaks). Not only are returns heteroskedastic (that's why people use GARCH models and such like), but even the residuals from those models still remain peakier and heavier-tailed than normal (t-distributions are sometimes used); & they're generally skew – Glen_b Aug 21 '18 at 00:33
  • 1
    I'd expect all of those to be present in CAR. I'd also suggest leaning toward using a Q-Q plot rather than a P-P plot (your title says Q-Q but you actually have a P-P). P-P plots "squash" differences in the tail, which you will probably want to avoid making hard to identify – Glen_b Aug 21 '18 at 00:41
  • 1
    @whuber: I think this would actually be excess positive kurtosis, not negative. You have a high centre, low shoulders and fat tails (as indicated by standardised residuals of approximately five standard deviations). It would be best for the OP to calculate and report the higher moments so that we can see, but I would expect it to have positive excess kurtosis. – Ben Aug 21 '18 at 03:40
  • 1
    Ben, you're correct; heavier tail (and to some extent higher peak) tend to be associated with positive excess kurtosis. I think that's probably a slip of the fingers on whuber's part. – Glen_b Aug 21 '18 at 04:28
  • I have added a Q-Q plot now and also included the scatterplot for the constant variance test. @Glen_b would you consider based on the scatterplot output that this is heteroskedastic? – JohnKimble Aug 21 '18 at 08:57
  • 1
    @Ben Yes, I agree--I misread the plot. The QQ plot that has since been plotted makes the interpretation clear. – whuber Aug 21 '18 at 13:27
  • The comments about kurtosis measuring peaks are misleading. Kurtosis measures heaviness of tails only; virtually nothing about the peak. Further, there is a direct mathematical connection between excess kurtosis and the tails of the normal q-q plot; see here: https://stats.stackexchange.com/a/354076/102879 – BigBendRegion Oct 25 '18 at 12:25