1

I am wondering what would be the normal way for a data scientist to validate if the data is skewed or not. Is it by plotting the histogram or by finding skewness/kurtosis value (ex:- using pandas methods etc)?

What is the correct way? and what is the normal way it's done in real data analysis/machine learning work?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • Welcome to Cross Validated! For what purpose would you be examining the skewness? – Dave Dec 29 '21 at 16:17
  • Wording such as "correct way" and "normal way" isn't especially helpful. Who knows across all the practitioners and all the projects anywhere which methods are especially common? @BrudeET guesses boxplots; I guess histograms are still more common than boxplots; but where are the data? – Nick Cox Dec 29 '21 at 17:23
  • 1
    Kurtosis is different from skewness. – Nick Cox Dec 29 '21 at 17:24
  • 1
    Skewness can be analyzed in detail by studying the N-letter summary and plotting the mid-letter statistics against the spreads, as described by John Tukey in *EDA.* See https://stats.stackexchange.com/a/96684/919 for an explanation and examples. I believe few data scientists know or even care about this technique, but it is simple, powerful, and useful. Perhaps asking about "correct" or "normal" methods is not going to yield very good answers... . – whuber Dec 29 '21 at 18:55
  • @Dave I just want to make sure that the independent variables are not highly skewed, though i have not read it any where. – Hari Upadrasta Jan 03 '22 at 01:18
  • @NickCox yes, Kurtosis and Skewness are different. I think i was asking about distribution being normal. Though i am not sure why it should be normally distributed, as i have not read any reference document/book which says it should be. But i have seen plenty of models created after making distribution of input variable normal. – Hari Upadrasta Jan 03 '22 at 01:46
  • 1
    @whuber i am still going through the link. Thanks for the link. – Hari Upadrasta Jan 03 '22 at 01:46

2 Answers2

0

Evaluation:

  • Maybe looking at boxplots is the most common way. Strongly skewed data often show far outliers. But even normal and some other symmetrical distributions can show outliers.

Example: Out of 100,000 normal samples of size 100, over half had outliers; there was almost one outlier per sample on average.

out = replicate(10^5, 
    length(boxplot.stats(rnorm(100))$out))
mean(out>0)
[1] 0.5204
mean(out)
[1] 0.92163
  • If the issue is that you'd like normal data, then using formal tests of normality is not usually the best approach. For small samples, normality tests can fail to reject for data from skewed distributions; for large samples, these tests can reject for data that is close enough to normal for many purposes. Look at normal probability plots instead.

Examples: The Shapiro-Wilk test is one of the best tests of normality. Even so, out of 100,000 exponential samples of size ten, this test found less than half to be non-normal.

pv = replicate(10^5, shapiro.test(rexp(10))$p.val)
mean(pv <= .05)
[1] 0.44557

Moreover, out of 100,000 samples of size 1000 from Student's t distribution with 30 degrees of freedom (hardly distinguishable from normal for many practical purposes) about 20% failed the S-W test of normality.

pv = replicate(10^5, shapiro.test(rt(1000,30))$p.val)
mean(pv <= .05)
[1] 0.20981

Execution:

  • If possible, try to find a method that works well for untransformed data. Transformation, even when necessary, can make it hard to understand and explain the results of tests.

Example: Suppose we have an exponential sample of size 40 from a population with $\mu = 50.$ Some might depend on the robustness of t methods to get the (approximate) 95% CI $(33.1,64.1)$ for $\mu.$ However, the relationship $\frac{\bar X}{\mu}\sim\mathsf{Gamma}(40,40)$ can be pivoted to give the exact 95% CI $(36.5, 68.0).$

x = rexp(40, 1/50)
mean(x)
[1] 48.60372

t.test(x)$conf.int
[1] 33.07733 64.13012
attr(,"conf.level")
[1] 0.95

mean(x)/qgamma(c(.975,.025), 40, 40)
[1] 36.46582 68.03293
  • If transformation is necessary, try to use one that minimizes difficulties of interpretation. In some instances, using a rank transformation works and results are easy to understand. Differences in logs amount to ratios of original data.
BruceET
  • 47,896
  • 2
  • 28
  • 76
  • Until the very end, this answer seems focused exclusively on what *not* to do, rather than on what *to* do! Valid advice, to be sure--but not terribly responsive to the original question... . – whuber Dec 29 '21 at 19:01
  • @whuber. That's a fair comment on my Answer, and probably also on my views of transformations to avoid skewness. I probably have much to learn about transformations, but my current view is that they are used too often. – BruceET Dec 29 '21 at 19:08
  • 1
    I, too, suspect they might be applied to *response variables* too often in small statistical applications, but perhaps not often enough in large, complex cases where some black box algorithm will be applied and "feature engineering" can be important or where transformations of the explanatory variables could allow for simpler or more accurate modeling. – whuber Dec 29 '21 at 20:04
0

Read up on and consider lognormal distributions. They are very common in many fields of science. A log transform converts a lognormal distribution to a gaussian. But this is not just a mathematical trick, to be done when a normality test flags the distribution as nonnormal. There are reasons why data are lognormal, and it is appropriate to account for this with a log transform of the data before some analyses, and a reverse log transform of some results (to convert differences between logs to the ratio of actual values).

For ratio data (only positive values are possible; changes are multiplicative rather than additive), some argue that assuming a lognormal distribution should be the default.

Harvey Motulsky
  • 14,903
  • 11
  • 51
  • 98