Questions tagged [dbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm. DBSCAN views clusters as areas of high density separated by areas of low density.

DBSCAN is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

-- Wikipedia

Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

-- scikit-learn

Below is an example of DBSCAN finding clusters based on their density, as opposed to their distance from a centroid (i.e. as k-means would):

enter image description here

86 questions
18
votes
3 answers

A routine to choose eps and minPts for DBSCAN

DBSCAN is most cited clustering algorithm according to some literature and it can find arbitrary shape clusters based on density. It has two parameters eps (as neighborhood radius) and minPts (as minimum neighbors to consider a point as core point)…
Mehraban
  • 535
  • 2
  • 7
  • 15
11
votes
4 answers

Choosing the number of clusters in hierarchical agglomerative clustering

I have a set of points that I want to cluster into groups according to a number of features computed. I have distance matrix containing the distances between all different pairs of points. I have tried K-Means, and DBSCAN first but since I have no…
Moustafa Alzantot
  • 281
  • 1
  • 2
  • 7
9
votes
2 answers

Does k-means have any advantages over HDBSCAN expect for runtime?

I have recently learned about HDBSCAN (a fairly new method for clustering, not yet available in scikit-learn) and am really surprised at how good it is. The following picture illustrates that the predecessor of HDBSCAN - DBSCAN - is already the only…
Thomas
  • 213
  • 3
  • 7
9
votes
1 answer

Why are most of my points classified as noise using DBSCAN?

I'm using several clustering algorithms from sklearn to cluster some data, and can't seem to figure out what's happening with DBSCAN. My data is a document-term matrix from TfidfVectorizer, with a few hundred preprocessed documents. Code: tfv =…
filaments
  • 93
  • 1
  • 5
9
votes
2 answers

What is the interpretation of eps parameter in DBSCAN clustering?

I want to cluster lat-long data such that all clusters formed will have radius<=1000 meters Questions What is the actual meaning of eps parameter? Please given an example. Will setting eps=1000 serve my purpose if distance measure is haversine in…
8
votes
1 answer

How to compare dbscan clusters / choose epsilon parameter

I am currently trying to make a DBSCAN clustering using scikit learn in python. I would like to compare the different outputs when varying the epsilon parameter in order to choose the right epsilon parameter. I took as an example the iris dataset.…
Scratch
  • 754
  • 2
  • 6
  • 17
7
votes
0 answers

True positive, false negative, true negative, false positive definitions for multiclass-multilabel classification?

I'm trying to apply some evaluation metrics to several clustering methods. I thought that I knew them basing on the multiclass confusion matrix, considering the rows as the actual classes and the columns as the predicted clusters: TP would be the…
7
votes
1 answer

Clustering without a distance matrix

I've recently completed a project where I used scikit-learn's DBSCAN module to find clusters in relatively short strings of text. I used a custom string similarity metric to allow for vectorized computation of an $n^2$ similarity matrix. I know that…
Stankalank
  • 141
  • 4
6
votes
1 answer

DBSCAN: What is a Core Point?

I have a question about DBSCAN. The points here are classified as core points, border points or noise. A point p is a core point if at least minPts points are within distance ε of it, and those points are said to be directly reachable from p. No…
user1170330
  • 209
  • 2
  • 7
4
votes
1 answer

Anomaly detection in multivariate time series data

I am trying to solve an anomaly detection problem that consists of three variables captured over a span of five years. It is an unsupervised problem, and I believe density-based clustering methods like DBSCAN aren't a good fit for this problem as it…
ajit samudrala
  • 123
  • 1
  • 7
4
votes
1 answer

Principal component analysis and DBScan

My data has 30 dimensions and 150 observations. I want to cluster the data with DBScan. Is there a difference between: 1. Performing a PCA and clustering all 30 principal components or 2. Just clustering the raw data? DBScan works fast on my…
PascalIv
  • 404
  • 4
  • 10
4
votes
1 answer

Is (a) multicollinearity and/or (b) binary variables an issue for DBSCAN? if so, how can one correct for these issues?

I have read some related questions, such as: Why are mixed data a problem for euclidean-based clustering algorithms?, What data structure to use for my cluster analysis or what cluster analysis to use for my data?, K-Means Clustering with Dummy…
3
votes
0 answers

Cluster Algorithm for multidimensional data

My goal is to cluster data (20000 samples with a range from 0.0 to 1.0, and 14 dimensions/features). Since I don't know the number of clusters, I tried using MeanShift and DBSCAN. My problem with these algorithms is that they find one large cluster…
Flitschi
  • 31
  • 1
3
votes
2 answers

Suitable approach to cluster histogram-like dataset using HDBSCAN implementation in python

My dataset below shows product sales per price (link to download dataset csv): price quantity 0 5098.0 20 1 5098.5 40 2 5099.0 10 3 5100.0 90 4 5100.5 20 .. ... ... 290 5247.0 …
3
votes
1 answer

Clustering data with covariance for each point

I am looking to cluster data points that each have a covariance around itself (based on some function of its neighbourhood, but how I got it is not important). I would like to use the covariance to achieve these properties: points with relatively…
1
2 3 4 5 6