Evaluation:
- Maybe looking at boxplots is the most common way. Strongly skewed data often show far outliers. But even normal and some other symmetrical distributions can show outliers.
Example: Out of 100,000 normal samples of size 100, over half had outliers; there was almost one outlier per sample on average.
out = replicate(10^5,
length(boxplot.stats(rnorm(100))$out))
mean(out>0)
[1] 0.5204
mean(out)
[1] 0.92163
- If the issue is that you'd like normal data, then
using formal tests of normality is not usually
the best approach. For small samples, normality tests can fail to reject for data from skewed distributions; for large samples, these tests can reject for data that is close enough to normal for many purposes. Look at normal probability plots instead.
Examples: The Shapiro-Wilk test is one of the best tests of normality. Even so, out of 100,000 exponential samples of size ten, this test found less than half to be non-normal.
pv = replicate(10^5, shapiro.test(rexp(10))$p.val)
mean(pv <= .05)
[1] 0.44557
Moreover, out of 100,000 samples of size 1000 from Student's t distribution with 30 degrees of freedom (hardly distinguishable from normal
for many practical purposes) about 20% failed the S-W test of normality.
pv = replicate(10^5, shapiro.test(rt(1000,30))$p.val)
mean(pv <= .05)
[1] 0.20981
Execution:
- If possible, try to find a method that works well for untransformed data. Transformation, even when necessary, can make it hard to understand and explain the results of tests.
Example: Suppose we have an exponential sample
of size 40 from a population with $\mu = 50.$
Some might depend on the robustness of t methods to get the (approximate) 95% CI $(33.1,64.1)$ for $\mu.$ However, the relationship
$\frac{\bar X}{\mu}\sim\mathsf{Gamma}(40,40)$
can be pivoted to give the exact 95% CI
$(36.5, 68.0).$
x = rexp(40, 1/50)
mean(x)
[1] 48.60372
t.test(x)$conf.int
[1] 33.07733 64.13012
attr(,"conf.level")
[1] 0.95
mean(x)/qgamma(c(.975,.025), 40, 40)
[1] 36.46582 68.03293
- If transformation is necessary, try to use one that minimizes difficulties of interpretation. In some instances, using a rank transformation works and results are easy to understand. Differences in logs amount to ratios of original data.