3

I am not necessarily interested in testing normality, but at least ensure that:

  • The mean is near 0

  • There is single mode

  • Tails are thinning out as you go further from the mean

Any ideas?

The Baron
  • 611
  • 1
  • 6
  • 16

1 Answers1

4

Distributions which satisfy those criteria can include distributions whose behavior is very different from the normal. For example, a $t$ distribution with two degrees of freedom satisfies those criteria, as does an asymmetric Laplace distribution, but each has very different properties from a normal.

I'll assume (though you don't state it) that you're primarily concerned with continuous variates. If you have discrete or even categorical data the advice would differ somewhat.

A common choice for assessing those sorts of things on data is the histogram, but caution is required; even small changes the binwidth and/or binorigin can in some cases lead to a big difference in appearance. Here's two histograms of the same data set:

Skew vs bell

It's often advisable to use narrower bins than the common defaults packages offer; we're trying to get a visual idea of shape and can smooth by eye. [The smoothing of the default settings is often quite strong.]

Another alternative which can work well, particularly in large samples, is the kernel density estimate; again, you may want to use a bit less smoothing than the default by choosing a narrower bandwidth. Here's an example using the same data as above, but with half the default bandwidth, because the default (dashed curve) obscures the clumpiness in the original data that produce the inconsistency in the histogram:

enter image description here

This can sometimes be useful for spotting multiple modes, though small modes can be hard to spot (or tell apart from noise) with any dislay.

Another option is the quantile-quantile plot, or Q-Q plot - the normal Q-Q plot is very common - and this can convey a lot of information about shape, tail behavior (particularly if you want to see if tails are heavier than or lighter than that of a normal, say), and symmetry, but it takes some practice to learn to read them.

enter image description here

In this case the unusual both the suggestion of mild right skewness and the odd clumpiness can be seen. Judging symmetry can be aided by also displaying a density estimate for $2M-x$ (where $M$ is some measure of center, perhaps the median for example). You don't have to calculate a new density estimate for that; the original just needs to be plotted against $2M-x$ instead of $x$.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Is it normal to get really asymptotic values with `rt(n,df=2)`? I'm used to the nice `hist(rnorm(1e4))`, and trying to reproduce your images, I get really messed up histograms: `set.seed(99); n – Antoni Parellada Nov 26 '15 at 01:23
  • @Antoni Note that my mention of the $t_2$ was in respect of its *density* (See `dt`). e.g. try looking at say `dt2=function(x) dt(x,2);curve(dt2,-10,10);abline(h=0,col="dimgrey")` The $t_2$ is pretty heavy tailed (its variance is not finite), so random values from it can be very large. Since a very large outlier can distort the appearance of most displays, you may want to plot say the middle 99% (for a standard $t_2$, try `hist(y,xlim=c(-10,10),n=2000)` which should typically contain just over 99% of the data values and for $10^4$ points looks reasonably like the curve above). – Glen_b Nov 26 '15 at 01:31
  • I'm getting an error message: `Error in hist(x, xlim = c(-10, 10), n = 1000) : object 'x' not found`, but going back for a second to the initial code in my comment, if you don't mind it, I'm surprised that I can't get it to behave even changing it to: `set.seed(99); n – Antoni Parellada Nov 26 '15 at 01:37
  • 1
    @Antoni That's because your data were in `y` (mine were in `x`); sorry about that. I have now edited my comment so you can run it straight on your data. You need a lot of breaks (my `n`), since it specifies the number of breaks over the entire range, not the plotted range. Try `breaks=2000` (equiv. `n=2000`) and use a wider range of values. – Glen_b Nov 26 '15 at 01:38
  • Beautiful! Sorry - I was too absorbed in my own lines... it's a problem with the order of magnitude of the breaks needed. If I increase them to $2,000$, it works beautifully: `set.seed(99); n – Antoni Parellada Nov 26 '15 at 01:44
  • Very little of the data values are in the tails, but they're so far away they screw the scale up. An alternative display that's a bit less work is: `library(MASS);truehist(y,xlim=c(-10,10))`. That function doesn't base its binwidth on the range. In fact I think its builtin function uses `nclass.FD` which returns just under 2000 bins on your data set. (Edit: no, the Scott rule is default in `truehist`) – Glen_b Nov 26 '15 at 01:45
  • But how do I output a specific statistic that numerically evaluates those qualities? That is something I am more interested in, rather than visually(graphically) examining the data. – The Baron Nov 26 '15 at 16:56