how to adjust a skewed distribution to normal?

Question

The Totalpoints variable of the Decathlon dataset in R is a skewed distribution. How can I adjust it to normal?

library("GDAdata")
data(Decathlon, package = "GDAdata")
qqnorm(Decathlon$Totalpoints)

Why do you want to transform the data to match a normal distribution? — EdM, Jan 06 '20 at 02:58
See also: https://stats.stackexchange.com/questions/443265/there-is-a-significant-different-result-between-qqplot-and-lillie-test-for-norma — Sal Mangiafico, Jan 06 '20 at 12:02
coz, I want to analyse the data by ANOVA. I found that when the population is not normality, the residuals also don't match a normal distribution. the assumption of ANOVA doesn't hold and can't support analysis results — Gump Chan, Jan 06 '20 at 12:13
In general it is not the case that if the dependent variable is not normal, that, say, the residuals from an anova will not be normal. Consider two groups with different means, both normal, both with the same variance. These would be a perfect candidate for a *t* test. But if you looked at the distribution of the two groups together, it would likely be bimodal. — Sal Mangiafico, Jan 07 '20 at 02:13

score 6 · Answer 1 · answered Jan 06 '20 at 13:01

In a comment, you say you want to transform it to do ANOVA and that your residuals are not normally distributed with the original variable.

But, rather than making your data fit a model, I'd pick a model that fits your data.

I suggest quantile regression or robust regression.

score 2 · Accepted Answer · answered Jan 06 '20 at 12:23

2

One method is an inverse normal scores transformation. If you'll forgive me copying the references from my own R function, useful references include:

Conover, 1995, Practical Nonparametric Statistics, 3rd.

Solomon & Sawilowsky, 2009, Impact of rank-based normalizing transformations on the accuracy of test scores.

Beasley and Erickson, 2009, Rank-based inverse normal transformations are increasingly used, but are they merited?

if(!require(GDAdata)){install.packages("GDAdata")}
if(!require(rcompanion)){install.packages("rcompanion")}

library("GDAdata")
data(Decathlon, package = "GDAdata")
qqnorm(Decathlon$Totalpoints)

library(rcompanion)

Blom = blom(Decathlon$Totalpoints)

qqnorm(Blom)
qqline(Blom, col='red')

answered Jan 06 '20 at 12:23

Sal Mangiafico

7,128
2
10
24

Thx u, That exactly does work, and when I use it, is there any potential issues? second, can explain the results derived by transformed data as the original data, namely the transformation don't need to be taken account in, when I explain the analysis result – Gump Chan Jan 06 '20 at 12:44
1

The transformation needs to be mentioned in the results. Also note that the new distribution has a mean of 0 and an sd of 1. – Sal Mangiafico Jan 06 '20 at 13:04
ok, is there any direct mapping between the original data and transformed data? How much impact does it have on the distribution of the original data? e.g. I find some factors lead to the variation of transformed data. Can I say there is the same effect on the original data? – Gump Chan Jan 06 '20 at 13:25
There are few different normal scores transformations. By default, this function uses Elfving. Probably more common are Van der Waerden and Blom. They are usually all pretty good. The Elfving transformation is `qnorm((rank(x) - pi/8)/(N - pi/4 + 1))`, where `x` is the original data, `N` is the number of observations, and `pi` is the constant *pi*. – Sal Mangiafico Jan 06 '20 at 13:53
1

The implication of using a transformed dependent variable in an analysis is of course that the results are relevant for the original data. But I wouldn't say that you can say that the effect is the same on the original data. You are usually limited to reporting e.g., for a log transformation, "Treatment 1 had a higher mean log yield than Treatment 2." As an example, make up some data for two variables. Do a Pearson correlation on them. Then log transform one of them, and do the Pearson correlation on this new variable with the other variable. The effect is not the "same". – Sal Mangiafico Jan 06 '20 at 13:58
Another reference is David, F. N., Barton, D. E. et al. (1968). Normal Centroids, Medians and Scores for Ordinal Data, Tracts for Computers, Vol XXIX, Cambridge: Cambridge University Press. – JimB Jan 08 '20 at 05:25

Dave · Answer 3 · 2020-01-06T03:02:24.330

I’m not sure that this will accomplish your ultimate goal, but you can hit it with the empirical CDF and then use the results to pick your values from a normal distribution, as the values resulting from the empirical CDF transformation (probability integral transform) will be quantiles. You then look up what the corresponding quantiles are for the normal distribution with the mean and variance you want.

set.seed(2020)
x <- rexp(1000,1); hist(x)
ex <- ecdf(x)(x)
qx <- qnorm(ex); hist(qx)

The first line of the code makes it so you will get the exact same results that I get.

The second line simulates a skewed exponential distribution and plots a histogram of the data to how definite skewness.

The third line does the probability integral transform (empirical CDF). An interesting theorem is that a continuous distribution being transformed this way results in a uniform distribution on $(0,1)$, and this fact is necessary for the next step. (This fact is proved in Casella/Berger and probably most other books that deal with calculus-based probability.)

The fourth line finds the values of a standard normal distribution at the quantiles given by the probability integral transform. Then it plots a histogram that looks standard normal to me.

This method has two pitfalls.

1) The empirical CDF transformation goes goofy when there are lots of tied values. (We don't get a uniform distribution. Try doing my code but with a bunch of 0s appended to my x before the ecdf line.)

2) There can be numerical instability if you have very extreme values.

Georg M. Goerg · Answer 4 · 2020-01-08T02:09:20.330

I'm usually a fan of transforming data to Normality (see here, here, and here). However in this case I +1 @Peter Flom's comment on rather than fitting the data to the model, choose an appropriate model for the data.

In particular, the help page ?Decathlon says that the data was filtered for Totalpoints >= 6800, i.e., you observe a sample from a truncated distribution (the right tail), not the full marginal distribution. That means a) a continuous transform won't be able to deal with this truncation elegantly (you have a built-in discontinuity at $T = 6800$) and b) w/o knowing too much about what type of ANOVA inference you want to do, most likely a plain vanilla ANOVA will be severely biased / invalid given the truncated nature of your data.

FWIW some interesting patterns emerge when you look at how athletes perform over time. This might also be relevant for your analysis.

# 'Rafael Cardoso Pinto' appears twice in 2004. Must be a data error. Using first appearance as unique value.
data_wide = reshape(Decathlon[, c("DecathleteName", "yearEvent", "Totalpoints")],
                    idvar = "DecathleteName", timevar = "yearEvent",
                    direction = "wide")
row_names <- data_wide[, 1]
data_wide <- data_wide[, -1]
rownames(data_wide) <- row_names
colnames(data_wide) <- gsub("Totalpoints.", "", colnames(data_wide))
data_wide = as.matrix(data_wide)
> dim(data_wide)
[1] 2709   22

That is the dataset contains a total of 2,709 athletes across 22 years (1985 - 2006). A heatmap is useful to visualize any patterns:

library(superheat)
superheat::superheat(t(data_wide), heat.na.col = "white")

And interestingly enough, a couple of common-sense patterns clearly emerge here:

as I mentioned above, the data is truncated at $>=6,800$ (dark blue)
athletes that perform bad in their first year (dark) tend to not compete again in the future
if athletes don't perform too bad in their first year, they tend to return
they typically stay around for a couple of years until they drop out again (too old to compete)
some athletes show clear improvements over time (vertical spikes that start green and turn into light-green / yellow as time increases)

This is actually quite an interesting dataset to work with and lots of questions arise by just looking at this plot (e.g., forecast performance of athlete $i$ for years that they will compete in future; impute missing values when athletes were missing a particular year -- maybe NMF? --).

I find the graphic very pretty but difficult to decipher. You can assist your readers by making it clear what the horizontal axis represents. You might also consider a graphical solution that uses a more definite graphical metaphor than color to represent outcomes, because in this one the trends get lost in the sea of colors. Could you explain how your post helps answer the question? — whuber, Jan 08 '20 at 15:18

Todd Burus · Answer 5 · 2020-01-06T00:34:22.357

0

There's not a universal way to transform every dataset, but this article gives a couple of the options you can try based on skewness observed. It appears that this variable is right-skewed so you would probably want to start with a log or square root transformation and then check the histogram. Since the values are non-negative I'd try the square root transform here first.

edited Jan 06 '20 at 00:34

answered Jan 05 '20 at 23:00

Todd Burus

632
2
12

yea, I have read this paper. I tried to use log() and some other methods to normalise the data, the shape of the graph, but the result of qqnorm doesn't change. why? – Gump Chan Jan 05 '20 at 23:16
Can you include the code you are using to do this? – Todd Burus Jan 06 '20 at 00:12
Thank you. I edited to add some more detail. – Todd Burus Jan 06 '20 at 00:36
here is my code for you to observe: qqnorm(Decathlon$Totalpoints) qqline(Decathlon$Totalpoints,col='red') lillie.test(Decathlon$Totalpoints) qqnorm(log(Decathlon$Totalpoints)) qqline(log(Decathlon$Totalpoints,col='red')) lillie.test(log(Decathlon$Totalpoints)) – Gump Chan Jan 06 '20 at 12:16
@GumpChan , The data are skewed enough that a log transformation does little to alter the shape. For a Tukey ladder -type transformation (square root, log, and so on), the best I found is ` New = -1 * x ^ -7.5`, where `x` is the original data. The qq plot still has the characteristic S shape, and isn't particularly normal. – Sal Mangiafico Jan 06 '20 at 15:08
That's an important consideration. As was mentioned in another answer, if transformation to make normal is too complex, maybe it's best to analyze using a different method. – Todd Burus Jan 06 '20 at 15:38
A normal scores transformation is used in the [Van der Waerden test](https://en.wikipedia.org/wiki/Van_der_Waerden_test), so it may not be too foreign to readers. But I agree with the sentiment of @PeterFlom : it's often better to find a model that fits your data rather than try to make your data fit a certain model. – Sal Mangiafico Jan 06 '20 at 16:15

how to adjust a skewed distribution to normal?

5 Answers5