0

my data set has got 821.000 rows and 18 columns. It is about online clickstream behavior. My variables are number of shopping baskets, number of items in the shopping basket, number of product pages viewed, number of category pages viewed, existing customer, new customer, buy/cancel shopping basket...

Analysis of the descriptive statistics has shown that many of the variables are right-skewed and have different variance. Therefore I have done a z-standardization. Since the range of all variables varies a lot, I wonder if this is a problem of calculating the distances of K-Means? Should the variables be normalized (min-max normalization) after z-standardization?

 summary (Baur_WKA_scale)
     BASKETS_NZ           PIS              PIS_AP             PIS_DV                    PIS_PL           PIS_SDV       
 Min.   :-8.7663   Min.   :-0.7741   Min.   :-0.48168   Min.   :-0.45676   Min.   :-0.3508   Min.   :-0.3565  
 1st Qu.: 0.1139   1st Qu.:-0.5921   1st Qu.:-0.48168   1st Qu.:-0.45676   1st Qu.:-0.3508   1st Qu.:-0.3565  
 Median : 0.1139   Median :-0.3736   Median :-0.48168   Median :-0.45676   Median :-0.3508   Median :-0.3565  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.1139   3rd Qu.: 0.2089   3rd Qu.:-0.02117   3rd Qu.: 0.07832   3rd Qu.:-0.1749   3rd Qu.:-0.1012  
 Max.   : 8.9942   Max.   :17.8668   Max.   :32.21453   Max.   :26.29717   Max.   :24.1894   Max.   :35.9036  
   PIS_SHOPS            PIS_SR           QUANTITY       
 Min.   :-0.43738   Min.   :-0.3764   Min.   :-0.54754  
 1st Qu.:-0.43738   1st Qu.:-0.3764   1st Qu.:-0.54754  
 Median :-0.38040   Median :-0.3764   Median :-0.26601  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.:-0.03852   3rd Qu.:-0.1092   3rd Qu.: 0.01552  
 Max.   :22.63957   Max.   :29.2868   Max.   :39.42954  
Kitty123
  • 35
  • 4
  • @ ttnphns Thanks for your answer. However, I could not find an answer to the min max normalization question in this post. Is it necessary to normalize the variables before K means cluster analysis if the range of variables is different? I have attached the summary of the data after the z-standardization. – Kitty123 Apr 24 '20 at 13:04
  • Neither standardization nor normalization - or wharever transformation - is _necessary_ before k-means. If it were "necessary", the program would do automatically. You need to decide yourself if this or that transform is reasonable in your specific case, that is, will it help reveal the clusters. There is no universal recipe. That said, I'd remark, however, that min-max normalization is not quite common before clustering. Especially when there are outliers. – ttnphns Apr 24 '20 at 13:08
  • @ ttnphs Thanks for your advice. Your answer is helpful for my further proceeding – Kitty123 Apr 24 '20 at 13:21

0 Answers0