Scikit Learn DBSCAN with Dice Coefficient

Question

I am trying to cluster a high dimensional data set - Young People Survey Data

https://www.kaggle.com/miroslavsabo/young-people-survey

This is my first pass and wanted to give clustering the entire dataset a shot. I understand it is probably not the best way of clustering given the number of dimensions are so high (173). Trying it to see an initial set of results. For start I was trying to use scikit-learn DBSCAN a shot using the following code

import numpy as np
import pandas as pd
from scipy.spatial.distance import dice
from sklearn import metrics
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

file_location = "responsessmall.csv"
encoded_file_location = "responses_encoded.csv"

columns_to_encode = ['Smoking','Alcohol','Punctuality','Lying','Internet    usage','Gender','Left - right handed','Education','Only child','Village - town','House - block of flats']



def load_dataset():
    dataframe = pd.read_csv(file_location)
    #print(dataframe.isnull().any())
    dataframe.fillna(method="ffill",inplace=True)
    return dataframe




def encode_dataset(dataframe):
    encoded_dataframe = pd.get_dummies(dataframe,columns=columns_to_encode)
    return encoded_dataframe



def cluster_data_using_dbscan(dataframe):
    dataframe = StandardScaler().fit_transform(dataframe)

    for row in dataframe:
        for row1 in dataframe:
            distance = dice(row,row1)
            print(distance)


    db = DBSCAN(eps=0.3, min_samples=4,metric=dice)
    db.fit_predict(dataframe)
    print(db.labels_)
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_
    print("Number of labels: ",len(labels))
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f"
       % metrics.silhouette_score(dataframe, labels))


if __name__ == "__main__":
    dataframe = load_dataset()
    print("Size of Dataframe: ",len(dataframe))

    encoded_data_frame = encode_dataset(dataframe)

    encoded_data_frame.to_csv(encoded_file_location)
    cluster_data_using_dbscan(encoded_data_frame)

When I run the code, I get an error (pasted below). Trying to understand what I am doing wrong. I would appreciate any suggestions as to how I can utilize a non parametric method like DBSCAN and things that can be improved in my code. Also open to suggestions for using other algorithms/similarity metrics.

Number of labels:  1010
Estimated number of clusters: 1
Traceback (most recent call last):
File "clusteranalysis.py", line 68, in <module>
cluster_data_using_dbscan(encoded_data_frame)
File "clusteranalysis.py", line 57, in cluster_data_using_dbscan
% metrics.silhouette_score(dataframe, labels))
File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/cluster/unsupervised.py", line 101, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/cluster/unsupervised.py", line 167, in silhouette_samples
check_number_of_labels(len(le.classes_), X.shape[0])
File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/cluster/unsupervised.py", line 19, in check_number_of_labels
"to n_samples - 1 (inclusive)" % n_labels)
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Process finished with exit code 1

Using the process outlined in

Why are most of my points classified as noise using DBSCAN?

I plotted the data points and they look like the image below

score 1 · Answer 1 · answered Sep 12 '17 at 18:49

If you choose a too small epsilon, all points will be noise.

Then silhouette cannot be computed because it needs at least two clusters.

But the sklearn silhouette doesn't appear to support noise anyway, so its results will be problematic even if you had a second cluster (you can not just consider noise to be a cluster - noise is not a connected group of points).

Judging from your plot, I doubt it will cluster well. I only see one big blob.

Scikit Learn DBSCAN with Dice Coefficient

1 Answers1