The Totalpoints
variable of the Decathlon dataset in R is a skewed distribution. How can I adjust it to normal?
library("GDAdata")
data(Decathlon, package = "GDAdata")
qqnorm(Decathlon$Totalpoints)
The Totalpoints
variable of the Decathlon dataset in R is a skewed distribution. How can I adjust it to normal?
library("GDAdata")
data(Decathlon, package = "GDAdata")
qqnorm(Decathlon$Totalpoints)
In a comment, you say you want to transform it to do ANOVA and that your residuals are not normally distributed with the original variable.
But, rather than making your data fit a model, I'd pick a model that fits your data.
I suggest quantile regression or robust regression.
One method is an inverse normal scores transformation. If you'll forgive me copying the references from my own R function, useful references include:
Conover, 1995, Practical Nonparametric Statistics, 3rd.
Solomon & Sawilowsky, 2009, Impact of rank-based normalizing transformations on the accuracy of test scores.
Beasley and Erickson, 2009, Rank-based inverse normal transformations are increasingly used, but are they merited?
if(!require(GDAdata)){install.packages("GDAdata")}
if(!require(rcompanion)){install.packages("rcompanion")}
library("GDAdata")
data(Decathlon, package = "GDAdata")
qqnorm(Decathlon$Totalpoints)
library(rcompanion)
Blom = blom(Decathlon$Totalpoints)
qqnorm(Blom)
qqline(Blom, col='red')
I’m not sure that this will accomplish your ultimate goal, but you can hit it with the empirical CDF and then use the results to pick your values from a normal distribution, as the values resulting from the empirical CDF transformation (probability integral transform) will be quantiles. You then look up what the corresponding quantiles are for the normal distribution with the mean and variance you want.
set.seed(2020)
x <- rexp(1000,1); hist(x)
ex <- ecdf(x)(x)
qx <- qnorm(ex); hist(qx)
The first line of the code makes it so you will get the exact same results that I get.
The second line simulates a skewed exponential distribution and plots a histogram of the data to how definite skewness.
The third line does the probability integral transform (empirical CDF). An interesting theorem is that a continuous distribution being transformed this way results in a uniform distribution on $(0,1)$, and this fact is necessary for the next step. (This fact is proved in Casella/Berger and probably most other books that deal with calculus-based probability.)
The fourth line finds the values of a standard normal distribution at the quantiles given by the probability integral transform. Then it plots a histogram that looks standard normal to me.
This method has two pitfalls.
1) The empirical CDF transformation goes goofy when there are lots of tied values. (We don't get a uniform distribution. Try doing my code but with a bunch of 0s appended to my x before the ecdf
line.)
2) There can be numerical instability if you have very extreme values.
I'm usually a fan of transforming data to Normality (see here, here, and here). However in this case I +1 @Peter Flom's comment on rather than fitting the data to the model, choose an appropriate model for the data.
In particular, the help page ?Decathlon
says that the data was filtered for Totalpoints >= 6800
, i.e., you observe a sample from a truncated distribution (the right tail), not the full marginal distribution. That means a) a continuous transform won't be able to deal with this truncation elegantly (you have a built-in discontinuity at $T = 6800$) and b) w/o knowing too much about what type of ANOVA inference you want to do, most likely a plain vanilla ANOVA will be severely biased / invalid given the truncated nature of your data.
FWIW some interesting patterns emerge when you look at how athletes perform over time. This might also be relevant for your analysis.
# 'Rafael Cardoso Pinto' appears twice in 2004. Must be a data error. Using first appearance as unique value.
data_wide = reshape(Decathlon[, c("DecathleteName", "yearEvent", "Totalpoints")],
idvar = "DecathleteName", timevar = "yearEvent",
direction = "wide")
row_names <- data_wide[, 1]
data_wide <- data_wide[, -1]
rownames(data_wide) <- row_names
colnames(data_wide) <- gsub("Totalpoints.", "", colnames(data_wide))
data_wide = as.matrix(data_wide)
> dim(data_wide)
[1] 2709 22
That is the dataset contains a total of 2,709 athletes across 22 years (1985 - 2006). A heatmap is useful to visualize any patterns:
library(superheat)
superheat::superheat(t(data_wide), heat.na.col = "white")
And interestingly enough, a couple of common-sense patterns clearly emerge here:
This is actually quite an interesting dataset to work with and lots of questions arise by just looking at this plot (e.g., forecast performance of athlete $i$ for years that they will compete in future; impute missing values when athletes were missing a particular year -- maybe NMF? --).
There's not a universal way to transform every dataset, but this article gives a couple of the options you can try based on skewness observed. It appears that this variable is right-skewed so you would probably want to start with a log or square root transformation and then check the histogram. Since the values are non-negative I'd try the square root transform here first.