Should I apply log transformation to column with long-tail distribution before clustering?

Question

I am doing clustering on a given data. When I plot the distributions of the individual features of this data, I found there are many columns that shows "long tail distribution".

I am wondering should I apply log transformation to these columns ?

If yes, what is the reason to do this?

I tried to conduct feature scaling only, which I would use $\frac{X−X.mean()}{X.std()}$. This should transform the values to be around 0. But I worry after the feature scaling the distribution is still so skewed that, it may be not useful to the clustering task? So I'm wondering if I should apply the log transformation before the feature scaling (standardization).

---------------------------------------------------

This is what happens when I apply log transformation:

Usually you would standardize your variables so that you eliminate the effects of huge variances. Which is what you've done here, in a sense. So it might work, only one way to find out. — user2974951, Jan 30 '19 at 15:09
@user2974951 I think you meant feature scaling, which I used $(X-X.mean())/X.std()$. But I worry after the feature scaling the distribution is still so skewed that, it may be not useful to the clustering task? So I'm wondering if I should do the log transformation before the feature scaling (standardization). — CyberPlayerOne, Jan 30 '19 at 15:48
Your transformation (log) is still skewed, but you have decreased the variance quite a bit. But yes, you cannot change the distribution with standardization. You may want to consider a clustering algorithm which accepts varying shapes of clusters, such as Gaussian Mixture Models. — user2974951, Jan 30 '19 at 15:50
This rather depends on the type of clustering algorithm you are using, and whether it is robust to non-normal distributions. — mkt, Jan 30 '19 at 20:53
I do not know much about your intention for clustering, but such transformation is often useful for numerical reasons, also for results presentations, outlier detection, etc. Maybe also to find a law for the tail behavior. If the clustering is e.g. regarding distances such transformation can also help. — user32038, Jan 30 '19 at 12:18
My objective of clustering is to segment the customers into two types using the purchase/behavior data. My initial intention for the log transformation was, taking the above part of data as an example, the majority of values falls below 100, meantime there are few values are between 100 and 700, some of which may be considered as outliers. If I conduct feature scaling like normalization or min-max scaling, the majority of values between 0 and 100 will be squashed into a very small range due to the existence of the few large values. I'm not sure if that's a problem for clustering. — CyberPlayerOne, Jan 30 '19 at 13:48
@Tyler十三将士归玉门 Please edit that information into your question — mkt, Jan 30 '19 at 20:53
@mkt Thank you for the comment. I'm wondering is there any article about the relations between clustering algorithms and data distributions? Thanks. — CyberPlayerOne, Jan 31 '19 at 05:16
It's a huge field. For a start, take a look at the other questions here tagged `clustering`, such as this one: https://stats.stackexchange.com/questions/3713/choosing-a-clustering-method — mkt, Jan 31 '19 at 08:19

score 1 · Answer 1 · answered Feb 02 '19 at 09:20

Whether you should do that depends on your task and data - we don't know either. It can be helpful, or it can harm. You need to understand your assignment to decide this.

Judging from your plot, it doesn't seem to help much.

You also seem to have 0 values, and hacked around this by using log(1+x). Why not sqrt(x)? Or pow(x, .12345)?

Instead of trying to find a transformation that makes the data distribution look pretty (which probably won't work, because you seem to have integer input, so it won't become smooth), try to better understand your problem: what do the numbers mean? How can these numbers be used to quantify similarity?

Should I apply log transformation to column with long-tail distribution before clustering?

---------------------------------------------------

1 Answers1