2

I have been researching about using DBSCAN with sklearn in python but it doesn't have Gower's distance metric built in. All the other implementations are in R in this community.

I'm using a dataset with categorical and continuous features and as far as I know PCA + DBSCAN with gower is a nice choice to use for segmentation.

Does anyone have an example or implementation of clustering with DBSCAN and Gower's distance?

Here is an example from sklearn documentation with the Euclidean distance.

print(__doc__)

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Gonzalo Garcia
  • 121
  • 1
  • 4
  • 1
    Gower distance/coefficient doesn't seem to be integrated into sklearn yet, but it looks like there's been a lot of work on it and some folks are "using it locally". Perhaps the discussion and references at https://github.com/scikit-learn/scikit-learn/issues/5884 will be helpful for you? – rickhg12hs Jan 13 '20 at 19:40
  • oh my ... 5 years since the issue is still open! @rickhg12hs – Gonzalo Garcia Jan 13 '20 at 19:45
  • 1
    Yeah, that's pretty disheartening, but I'm encouraged that some folks are using it anyway ... somehow. – rickhg12hs Jan 13 '20 at 19:48

2 Answers2

3

While gower distance hasn't been fully implemented into scikit-learn as a ready-to-use metric, we are lucky that many of the clustering-related functions (e.g., NearestNeighbor, DBSCAN) can take precomputed distance matrices instead of the raw data. To do this, you just need to specify metric = "precomputed" in the argument's for DBSCAN (see documentation for this explanation). This means we can use whatever distance metric we want to compute a distance matrix, and then we can give that distance matrix to DBSCAN and it will work just as well, skipping its own internal computing of the distance!

Therefore, you could write your own gower_distance function or you could use a pre-made one like the one found in the handy gower python package.

Then to cluster and analyze our data X would be as simple as:

import gower
from sklearn.cluster import DBSCAN

dist_matrix = gower.gower_matrix(X)
db = DBSCAN(eps = 0.3, min_samples = 10, metric = "precomputed").fit(dist_matrix)
labels = db.labels_
1

As of right now, scikit-learn does not support gower's distance.

Pull-request #16834 has a working version. You can either use that fork or copy and paste just the gower_distances function.

Brian Spiering
  • 260
  • 1
  • 7