Options for Clustering Analysis with Numeric & Nominal Data with Gower Distance

Question

I am working through some cluster analysis (trying to propose new item types for various clusters). I have data that has both numeric and nominal features. After creating dummy variables for all categorical data I am looking at a dataset of dimensions (72k x 100).

I have read things about Gower's distance and see where it may be applicable here but I am curious if my data is too high dimensional? One of my nominal features that I have dummified has around 40 levels (I have basically done a ranking since the original variable has about 900 levels but the top 40 levels consist of about 89% of the total counts for that variable so the rest of those levels I have just grouped into an 'other' category.

I am open to using R or Python (currently most of my processing is in python) and am familiar with the gower package in Python. However, when creating the distance matrix I have to tell gower.gower_matrix() (see example here https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/) which features are categorical. Since my data are already in dummy form, do I treat my binary 0s and 1s in my many columns as categorical columns?

Maybe there are other ways around this or R is simpler but I'm trying to devise a good way at handling my dataset with many dummy variables which originated from mixed data types.

Github code for Gower package in python: https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py

Here is some example code:

df: pd.DataFrame = pd.DataFrame({'lag': [1,2,0,0,10],
                                 'age': [44, 50, 38, 40, 60],
                                 'zone_central':[0,0,0,1,0],
                                 'zone_east':[0,1,1,0,1],
                                 'zone_unknown':[0,0,1,0,0],
                                 'zone_west':[1,0,0,0,0]})
print(df.head(20))

# set dummy variables as strings so they aren't interpreted as numeric in the gower package
df.iloc[:,2:6] = df.iloc[:,2:6].apply(str)
print(df.dtypes)

   lag  age  zone_central  zone_east  zone_unknown  zone_west
0    1   44             0          0             0          1
1    2   50             0          1             0          0
2    0   38             0          1             1          0
3    0   40             1          0             0          0
4   10   60             0          1             0          0
lag              int64
age              int64
zone_central    object
zone_east       object
zone_unknown    object
zone_west       object
dtype: object

Using the gower package with the last 4 variables set as True since they are dummy variables and I don't want to compute euclidean distance between them (the whole point of using Gower)

gower.gower_matrix(df, cat_features=[False,False,True,True,True,True])
array([[0.        , 0.22878788, 0.22878788, 0.21363637, 0.4378788 ],
       [0.22878788, 0.        , 0.12424242, 0.10909091, 0.2090909 ],
       [0.22878788, 0.12424242, 0.        , 0.01515151, 0.33333334],
       [0.21363637, 0.10909091, 0.01515151, 0.        , 0.3181818 ],
       [0.4378788 , 0.2090909 , 0.33333334, 0.3181818 , 0.        ]],
      dtype=float32)

Using the default with cat_features unspecified which infers datatypes from inputs

gower.gower_matrix(df)
array([[0.        , 0.22878788, 0.22878788, 0.21363637, 0.4378788 ],
       [0.22878788, 0.        , 0.12424242, 0.10909091, 0.2090909 ],
       [0.22878788, 0.12424242, 0.        , 0.01515151, 0.33333334],
       [0.21363637, 0.10909091, 0.01515151, 0.        , 0.3181818 ],
       [0.4378788 , 0.2090909 , 0.33333334, 0.3181818 , 0.        ]],
      dtype=float32)

I still get the same results (I assume the Gower function infers the last 4 columns are categorical since I have their dtypes set as 'object' from pandas.

`Since my data are already in dummy form`. I've not worked with the Gower functions in R or Python, but I'm pretty sure they don't need or expect you to recode your categorical features into sets of dummies. You categorical features to input should be your original nominal ones. — ttnphns, Feb 17 '21 at 15:30
Read about Gower measure here: https://stats.stackexchange.com/a/15313/3277 — ttnphns, Feb 17 '21 at 15:31

Options for Clustering Analysis with Numeric & Nominal Data with Gower Distance

0 Answers0