I am working through some cluster analysis (trying to propose new item types for various clusters). I have data that has both numeric and nominal features. After creating dummy variables for all categorical data I am looking at a dataset of dimensions (72k x 100).
I have read things about Gower's distance and see where it may be applicable here but I am curious if my data is too high dimensional? One of my nominal features that I have dummified has around 40 levels (I have basically done a ranking since the original variable has about 900 levels but the top 40 levels consist of about 89% of the total counts for that variable so the rest of those levels I have just grouped into an 'other' category.
I am open to using R or Python (currently most of my processing is in python) and am familiar with the gower
package in Python. However, when creating the distance matrix I have to tell gower.gower_matrix()
(see example here https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/) which features are categorical. Since my data are already in dummy form, do I treat my binary 0s and 1s in my many columns as categorical columns?
Maybe there are other ways around this or R is simpler but I'm trying to devise a good way at handling my dataset with many dummy variables which originated from mixed data types.
Github code for Gower package in python: https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py
Here is some example code:
df: pd.DataFrame = pd.DataFrame({'lag': [1,2,0,0,10],
'age': [44, 50, 38, 40, 60],
'zone_central':[0,0,0,1,0],
'zone_east':[0,1,1,0,1],
'zone_unknown':[0,0,1,0,0],
'zone_west':[1,0,0,0,0]})
print(df.head(20))
# set dummy variables as strings so they aren't interpreted as numeric in the gower package
df.iloc[:,2:6] = df.iloc[:,2:6].apply(str)
print(df.dtypes)
lag age zone_central zone_east zone_unknown zone_west
0 1 44 0 0 0 1
1 2 50 0 1 0 0
2 0 38 0 1 1 0
3 0 40 1 0 0 0
4 10 60 0 1 0 0
lag int64
age int64
zone_central object
zone_east object
zone_unknown object
zone_west object
dtype: object
Using the gower
package with the last 4 variables set as True
since they are dummy variables and I don't want to compute euclidean distance between them (the whole point of using Gower)
gower.gower_matrix(df, cat_features=[False,False,True,True,True,True])
array([[0. , 0.22878788, 0.22878788, 0.21363637, 0.4378788 ],
[0.22878788, 0. , 0.12424242, 0.10909091, 0.2090909 ],
[0.22878788, 0.12424242, 0. , 0.01515151, 0.33333334],
[0.21363637, 0.10909091, 0.01515151, 0. , 0.3181818 ],
[0.4378788 , 0.2090909 , 0.33333334, 0.3181818 , 0. ]],
dtype=float32)
Using the default with cat_features
unspecified which infers datatypes from inputs
gower.gower_matrix(df)
array([[0. , 0.22878788, 0.22878788, 0.21363637, 0.4378788 ],
[0.22878788, 0. , 0.12424242, 0.10909091, 0.2090909 ],
[0.22878788, 0.12424242, 0. , 0.01515151, 0.33333334],
[0.21363637, 0.10909091, 0.01515151, 0. , 0.3181818 ],
[0.4378788 , 0.2090909 , 0.33333334, 0.3181818 , 0. ]],
dtype=float32)
I still get the same results (I assume the Gower function infers the last 4 columns are categorical since I have their dtypes set as 'object' from pandas.