1

I am a novice in stats and I would like to transform my data (house prices) using a johnson unbounded distribution to look more gaussian. I looked at pandas transform() but I can't really understand johnsons u. parameters to apply a lambda. Could someone help me out doing this transformation in python? I have the parameters but don't know which is which to fit into the formula (or if there is an easier way to do it).

I would add more info, to try to make it more clear. First, I tried to identify the best distribution to fit my data, I did the following:

import scipy.stats as st
def get_best_distribution(data):
dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford',         'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']
dist_results = []
params = {}
for dist_name in dist_names:
    dist = getattr(st, dist_name)
    param = dist.fit(data)

    params[dist_name] = param
    # Applying the Kolmogorov-Smirnov test
    D, p = st.kstest(data, dist_name, args=param)
    print("p value for "+dist_name+" = "+str(p))
    dist_results.append((dist_name, p))

# select the best fitted distribution
best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
# store the name of the best fit and its p value

print("Best fitting distribution: "+str(best_dist))
print("Best p value: "+ str(best_p))
print("Parameters for the best fit: "+ str(params[best_dist]))

return best_dist, best_p, params[best_dist]

That identified my distribution as a johnson unbounded.

What I have from my data is:

import scipy.stats as st

dist_name ='johnsonsu'
data= Y
dist = getattr(st, dist_name)
param = dist.fit(data)
#params[dist_name] = param
#D, p = st.kstest(data, dist_name, args=param)
print(param)
[out]: (-1.5661340035204014, 1.4899654020936477, 93994.90877721814, 55321.65122078377)
  • `st` is what? and what is the difference between the four elements in `param` – develarist Sep 15 '20 at 19:15
  • ops. sorry, import scipy.stats as st – João Vitor Gomes Sep 15 '20 at 19:31
  • @develarist I think it is mean, variance, skew and kurtosis but I am not sure – João Vitor Gomes Sep 15 '20 at 20:33
  • show the `getattr()` function you wrote and the class it is applied to, and what would be the next step, where is the `param` vector plugged into? – develarist Sep 15 '20 at 20:41
  • getattr() is the same as dist = st.johnsonsu – João Vitor Gomes Sep 15 '20 at 20:49
  • because you set it equal to it. but before that – develarist Sep 15 '20 at 20:50
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/113046/discussion-between-joao-vitor-gomes-and-develarist). – João Vitor Gomes Sep 15 '20 at 21:01
  • 1
    There is generally no need to transform a variable to look Gaussian. See [this page](https://stats.stackexchange.com/q/247986/28500) among many others. Also, the usual KS test isn't valid when you compare against a distribution for which you have estimated the parameter values from your data. See [here](https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test#Test_with_estimated_parameters) – EdM Sep 15 '20 at 21:26
  • @EdM thank you for your response but even for a practical standpoint? Why my R^2 is better when I log the Y variable, for example. Also check this out: https://www.kaggle.com/loveall/1-house-prices-solution-top-1 ; this guy transformed his Y target and got in top 10 best predictions. – João Vitor Gomes Sep 15 '20 at 22:35
  • "It is apparent that SalePrice doesn't follow normal distribution, so before performing regression it has to be transformed. While log transformation does pretty good job, best fit is unbounded Johnson distribution." . @EdM , what about the independent variables do they need to be gaussian like? – João Vitor Gomes Sep 15 '20 at 22:36
  • @Edm also this book (https://statisticsbyjim.com/regression/regression-analysis-intuitive-guide/) on page 244 says "Determining which Variables to Transform ... Only the dependent variable: If your residuals do not follow the normal distribution or do not have a constant variance, transforming the dependent variable might fix the problem. Earlier in this chapter, I went through an example of using a transformation to correct nonconstant variance (heteroscedasticity)...." – João Vitor Gomes Sep 15 '20 at 22:52
  • See [this answer](https://stats.stackexchange.com/a/4833/28500) and the linked and related pages there for discussions of transformations of outcome variables and of predictors. You seldom _start_ by transforming an outcome variable. You might do that at a later step in model building if the residuals between observed and predicted values aren’t well behaved otherwise. But sometimes the problem instead is poor modeling of the predictors, and transforming the outcome will only make things worse. – EdM Sep 16 '20 at 02:54
  • @Edm do you know how to make this transformation? – João Vitor Gomes Sep 16 '20 at 15:00
  • No. This is the first time I heard of this distribution. NO variables, outcome or predictors, need to have any particular distribution. Many machine learning methods work regardless of any monotone transformation. In linear regression, normality of _residuals_ can matter for precise confidence intervals and p-values, but you still get the best linear unbiased estimates under [much weaker conditions](https://en.wikipedia.org/wiki/Gauss–Markov_theorem). To improve residuals a [Box-Cox](https://en.wikipedia.org/wiki/Power_transform#Box–Cox_transformation) or similar transformation might help. – EdM Sep 16 '20 at 15:28
  • @EdM thank you for all your insights. If you want to format your comments as an answer I will accept since no one else answered. Thanks. – João Vitor Gomes Sep 17 '20 at 01:27

1 Answers1

1

Enrico Fermi claimed that John von Neumann said:

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

On that basis, the 4-parameter unbounded Johnson distribution provides a way to transform an elephant into a standard normal distribution. The statistical question here is whether that's worth doing.* In this case, it's almost certainly not.

Many a "novice in stats" thinks that it's important to start with variables in a normal form:

I would like to transform my data (house prices) using a johnson unbounded distribution to look more gaussian.

I certainly do recall hearing, in my youth, presentations of linear regression that seemed to be based on assumptions of normality for outcomes or predictors, but that's not the case. The traditional statistical tests assume that the error term (estimated by the residuals) has a normal distribution with mean of 0, but under much weaker conditions a linear regression provides the best linear unbiased estimate (BLUE). Robust methods provide ways to assess statistical significance if the error-normality assumption is untenable.

Variable transformation can be important in regression modeling, but not typically to enforce normality of the variables themselves. Transformations of predictor variables might be important to meet the linearity assumption of the association between predictors and outcome. Restricted cubic splines provide a very flexible way to model a continuous predictor as part of a regression, more useful than anything the Johnson distributions can provide. Alternatively, modeling approaches like tree-based models will work identically regardless of a monotonic transformation of a predictor.

Transformation of an outcome variable might be needed to make residuals well enough behaved so that a BLUE can be obtained. But there's seldom a reason to start by forcing an outcome variable itself to take a normal distribution. Choosing transformations of predictors or outcomes to meet the demands of a particular problem, and knowing when to decide that something other than a linear or generalized linear regression approach is needed, are important parts of the art of modeling.


*The request for implementation in Python is off-topic on this site. This answer focuses on the statistical issues the request raises.

EdM
  • 57,766
  • 7
  • 66
  • 187