0

I am trying to build a linear regression model.

I have some high cardinal categorical features on which I want to apply target encoding. But my target (real-valued) variable distribution is highly right skewed, so I will apply some transform to get rid of skew.

Which of the following approach is sensible :

  1. I should transform my target variable first and then apply target encoding on categorical feature based on transformed target.

  2. I should apply target encoding on categorical feature based on original target. After that I should apply skew removal on my target variable.

Thanks in advance..

  • What do you mean by transforming a skewed categorical distribution? – Dave Mar 24 '21 at 17:14
  • My target variable is not categorical. It is real valued. – Sandeep Maurya Mar 24 '21 at 17:17
  • 1
    Depending on what you're doing, the transformation might not be so important; we like normal residuals, not a normal pooled distribution of the response variable. However, how does the category to which an observation belong depend on the transformation? – Dave Mar 24 '21 at 17:20
  • I wish to train a linear regression model using this. As I learnt, if input features as well as target variable has gaussian-like distribution then Linear models tends to perform better. – Sandeep Maurya Mar 24 '21 at 17:27
  • @SandeepMaurya You're likely looking at the histogram of the outcome, which is the marginal distribution of the outcome. The assumption of normality is about the *conditional* distribution. See my answer [here](https://stats.stackexchange.com/questions/476424/what-are-the-worst-commonly-adopted-ideas-principles-in-statistics/476435#476435) and the referenced answer therein. – Demetri Pananos Mar 25 '21 at 02:32

0 Answers0