2

At my job I am working on standardizing some data. Right now we are simply using z-scores and it's causing some problems. For instance, one outlier has had three four or five standard deviation moves this week! It's clearly not right.

The empirical distribution (pre-transformed) I'm working with has a kurtosis of 27 and fails every test for normality. I would like to transform this to something closer to a Gaussian distribution. I did some research and found this thread How to transform leptokurtic distribution to normality?, but it doesnt seem applicable to me.

The top comment recommends using the Lambert's W distribution, but the delta parameter seems to be the key and isn't working. His example uses the value .2 while I am using 1/27. His 'yy' variable fails the test for normality while my 'yy' variable passes, invalidating the results. There's also the 'median-subtracted third root' transform, but no chance ill be able to sell that to my boss.

can anyone help me with this? here's a vector I'm working with At my job I am working on standardizing some data. Right now we are simply using z-scores and it's causing some problems. For instance, one outlier has had several four or five standard deviation moves this week! It's clearly not right.

The empirical distribution (pre-transformed) I'm working with has a very high kurtosis and fails every test for normality. I would like to transform this to something closer to a Gaussian distribution. I did some research and found this thread How to transform leptokurtic distribution to normality?, but it doesnt seem applicable to me.

The top comment recommends using the Lambert's W distribution, but the delta parameter seems to be the key and isn't working. His example uses the value .2 while I am using 1/(realized kurtosis). His 'yy' variable fails the test for normality while my 'yy' variable passes, invalidating the results. There was also the 'median-subtracted third root' transformation, but no chance ill be able to sell that to my boss.

can someone help me with this transformation? here's my data https://www.dropbox.com/s/r31nub55umacf06/data.csv?dl=0

Joel Sinofsky
  • 741
  • 7
  • 18
  • 1
    In a great many contexts, forcing a distribution to be Normal just papers over any problems and often hides important information from the analyst. Could you explain *why* you want or need the data to be Normal? – whuber Oct 04 '17 at 21:27
  • 1
    This sounds like an [X-Y problem](https://meta.stackexchange.com/a/66378/202463). Can you explain the underlying question you're addressing by using z-scores? – Glen_b Oct 04 '17 at 21:34
  • the lowest kurtosis i could get was from 316 to 144. however, outlier's can be justifiably removed if there is reason to believe it was due to measurement error. one might also cynically argue that extreme outliers are evidence of such an error. – faustus Oct 23 '17 at 16:04

1 Answers1

2

You can estimate Lambert W x Gaussian distributions and transformations using IGMM as follows (from your data.csv file).

yy <- read.csv("~/Downloads/data.csv")[, "x"]
library(LambertW)
test_normality(yy)

original_data_normality

As you said, the data is clearly non-Gaussian with huge kurtosis (319) and negative skewness. Thus a natural candidate marginal model of your data is a double heavy-tailed Lambert W x Gaussian distribution (type = "hh"), which estimates heavy tails but with difference estimates for left and right tail.

mod <- IGMM(yy, "hh")
mod
 Parameter estimates:
 mu_x    sigma_x delta_l delta_r 
 0.184   0.052   1.331   0.603 

As expected from density and qqplot above, the left tail is much heavier than the right and not even first order moments exist ($\hat{\delta}_l > 1$). The back-transformed data can be obtained using

xx <- get_input(mod)
test_normality(xx)

lambertw_normalized

and has kurtosis $3$, skewness $0.0$ as it was obtained via methods of moments (IGMM()). However, it's still clearly not Gaussian as it has a multi-modal distribution; especially a strong peak concentration / peak around the mean $\hat{\mu}_X = 0.184$. I took a close look at the original data around $\hat{\mu}_X = 0.184$ and found that this is actually occurring in the original data. It would be good to understand where these exact same values come from (some sort of default -- constant -- value in your measurements?):

yy.center <- yy[yy > mod$tau["mu_x"] - 0.05 & yy <  mod$tau["mu_x"] + 0.05]
test_normality(yy.center)

lambertw_center

Also note that as we go from left to right the the density of x -- especially for values $x > \mu_X$ -- decreases, i.e., it seems like your 'x' variable depends on some other variable not included in your data (is this ordered by time?).

To summarize, we found that your data is a. not only heavy-tailed but also (significantly) skewed, b. non i.i.d., and c. has more than one mode with some default constant values at $y \approx 0.184$.

All these findings are not visible in the original data, but only in the latent Gaussian transformed data. Depending on your problem and questions you aim to answer (see X-Y problem that Glen_b referenced) this might be useful information to have.

I do note that estimating this via MLE_LambertW(yy, distname="normal", type="hh")gives degenerate solutions. Not sure why this occurs, but my guess is because of the multi-modality (non-Normality) of the latent data, where the MLE has issues when assuming a unimodal Gaussian. Interesting question from a methodology pov.

Georg M. Goerg
  • 2,364
  • 20
  • 21