How to check if a data set can be modelled by a normal distribution?

Question

I have a set of velocity data (37 data points) and I want to know how to check if the data can be modelled with a normal distribution.

From a guide on youtube I have calculated the CDF, expected values and Z-values and produced a plot of the expected values and real data against the z-values.

From here I'm lost as what to do next, how do I determine how well the data is modelled by a normal distribution from this plot? Some of the real data points lie above the line and some below and it seems to fit reasonably well but I would like to be more quantitative about it.

Please bear in mind I know only basic statistics and this has already gone way beyond my knowledge.

I'm curious how you calculated a CDF and z-values, because if your data are in anything more than one dimension, then velocities are vectors, by definition. Are you sure you don't have *speeds*? — whuber, Aug 14 '15 at 15:03
not clear what is the dimensions of normal distribution. is it spatial? 3D? — Aksakal, Aug 14 '15 at 15:37
The set of data is essentially the magnitudes of a set of velocities (speeds) so the normal distribution would be 2d I guess. — Ben Booth, Aug 14 '15 at 15:40
No, the distribution would have to be one-dimensional, since a speed is just a number. In many circumstances a normal distribution would be a poor choice of model for a distribution of speeds. What kind of thing do your speeds measure? — whuber, Aug 14 '15 at 17:10

score 3 · Answer 1 · edited Aug 15 '15 at 00:15

3

Here is a Wikipedia article on it.

In summary:

You can look at a histogram of the data, does the shape look similar to a normal distribution? You can look at quantile-quantile plots (Q-Q plots), do the sample quantiles of your data match up to the theoretical quantiles of a normal distribution?

You can do a hypothesis test to formally test this (Shapiro-Wilk test, etc)

What software do you use? I can show you specific examples of how to do the above mentioned techniques.

Here is a nice excel video for QQ plots

edited Aug 15 '15 at 00:15

Glen_b

257,508
32
553
939

answered Aug 14 '15 at 14:59

bdeonovic

8,507
1
24
49

1

I think the Q-Q plot is likely to be more relevant than doing a formal hypothesis test. – Glen_b Aug 14 '15 at 16:04
I concur with @Glen_b. Q-Q plots are easy to create in Excel, too. – whuber Aug 14 '15 at 17:11
QQ plots certainly are the go-to method. – bdeonovic Aug 14 '15 at 17:47

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

This began as a comment but it has grown much too long, and it is coming closer to being a kind of answer. This doesn't take anything away from bdeonovic's answer which is fine as a direct answer to the question asked. I'm going to focus more on the premises of the question.

I'd agree with whuber that in many situations speeds are unlikely to be particularly close to normally distributed. They're positive, generally right skew, and often have variability that's related to the mean (when mean speeds are low, there's not much "room" for values to vary away from the mean, since they can't be negative).
In some situations Rayleigh distributions can be useful for lengths of 2D vectors (speeds of things moving in 2D), and the Maxwell-Boltzmann is sometimes useful for lengths of 3D vectors. Both these distributions have variance proportional to mean-squared. In other situations I've seed adequate approximations from Weibull, gamma, or lognormal distributions (and one or two cases where it was "none of the above"). Used in the way these would usually be used for modelling speeds, all of these have variance proportional to mean squared as well. In some cases (depending on what you're doing), several of these may lead to very similar conclusions. In general terms they fit what qualitative things we know about speeds.
If for some reason you really needed to have a variable close to normality, you might consider transformation (e.g. see here in the case of the Rayleigh), but I would not usually recommend this approach as a first choice (e.g. it would not be my first choice if you're considering some kind of regression model); in some situations it may work well enough.
It may not be very important to have a normal distribution in any case. Why do you need your data to be normal? What kinds of things do you want to do with it? For example, if you're trying to model speed as a function of predictors, it will be important to get the form of the relationship of the mean speed to the predictors correct, and the mean-variance relationship close to correct. There are good alternatives to normal models that may work better.

How to check if a data set can be modelled by a normal distribution?

2 Answers2

Linked