Questions tagged [dbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm. DBSCAN views clusters as areas of high density separated by areas of low density.

DBSCAN is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

-- Wikipedia

Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

-- scikit-learn

Below is an example of DBSCAN finding clusters based on their density, as opposed to their distance from a centroid (i.e. as k-means would):

86 questions

votes

3 answers

A routine to choose eps and minPts for DBSCAN

DBSCAN is most cited clustering algorithm according to some literature and it can find arbitrary shape clusters based on density. It has two parameters eps (as neighborhood radius) and minPts (as minimum neighbors to consider a point as core point)…

clustering dbscan

asked Mar 05 '14 at 11:25

Mehraban

votes

4 answers

Choosing the number of clusters in hierarchical agglomerative clustering

I have a set of points that I want to cluster into groups according to a number of features computed. I have distance matrix containing the distances between all different pairs of points. I have tried K-Means, and DBSCAN first but since I have no…

clustering hierarchical-clustering dbscan hac

asked Jul 31 '13 at 12:34

Moustafa Alzantot

votes

2 answers

Does k-means have any advantages over HDBSCAN expect for runtime?

I have recently learned about HDBSCAN (a fairly new method for clustering, not yet available in scikit-learn) and am really surprised at how good it is. The following picture illustrates that the predecessor of HDBSCAN - DBSCAN - is already the only…

clustering k-means hierarchical-clustering dbscan

asked Jul 26 '18 at 09:57

Thomas

votes

1 answer

Why are most of my points classified as noise using DBSCAN?

I'm using several clustering algorithms from sklearn to cluster some data, and can't seem to figure out what's happening with DBSCAN. My data is a document-term matrix from TfidfVectorizer, with a few hundred preprocessed documents. Code: tfv =…

clustering scikit-learn text-mining dbscan

asked Mar 29 '17 at 18:47

filaments

votes

2 answers

What is the interpretation of eps parameter in DBSCAN clustering?

I want to cluster lat-long data such that all clusters formed will have radius<=1000 meters Questions What is the actual meaning of eps parameter? Please given an example. Will setting eps=1000 serve my purpose if distance measure is haversine in…

machine-learning clustering spatial hierarchical-clustering dbscan

asked Jul 26 '16 at 08:44

GeorgeOfTheRF

5,063
14
42
51

votes

1 answer

How to compare dbscan clusters / choose epsilon parameter

I am currently trying to make a DBSCAN clustering using scikit learn in python. I would like to compare the different outputs when varying the epsilon parameter in order to choose the right epsilon parameter. I took as an example the iris dataset.…

clustering scikit-learn parameterization dbscan

asked Dec 12 '13 at 15:20

Scratch

votes

0 answers

True positive, false negative, true negative, false positive definitions for multiclass-multilabel classification?

I'm trying to apply some evaluation metrics to several clustering methods. I thought that I knew them basing on the multiclass confusion matrix, considering the rows as the actual classes and the columns as the predicted clusters: TP would be the…

machine-learning classification precision-recall multi-class dbscan

asked Mar 18 '16 at 10:45

Emilio Genaro López

votes

1 answer

Clustering without a distance matrix

I've recently completed a project where I used scikit-learn's DBSCAN module to find clusters in relatively short strings of text. I used a custom string similarity metric to allow for vectorized computation of an $n^2$ similarity matrix. I know that…

clustering optimization scikit-learn dbscan

asked Jun 09 '14 at 05:19

Stankalank

votes

1 answer

DBSCAN: What is a Core Point?

I have a question about DBSCAN. The points here are classified as core points, border points or noise. A point p is a core point if at least minPts points are within distance ε of it, and those points are said to be directly reachable from p. No…

clustering data-mining dbscan

asked Feb 09 '16 at 12:16

user1170330

votes

1 answer

Anomaly detection in multivariate time series data

I am trying to solve an anomaly detection problem that consists of three variables captured over a span of five years. It is an unsupervised problem, and I believe density-based clustering methods like DBSCAN aren't a good fit for this problem as it…

time-series anomaly-detection dbscan

asked Nov 21 '18 at 05:30

ajit samudrala

votes

1 answer

Principal component analysis and DBScan

My data has 30 dimensions and 150 observations. I want to cluster the data with DBScan. Is there a difference between: 1. Performing a PCA and clustering all 30 principal components or 2. Just clustering the raw data? DBScan works fast on my…

pca dimensionality-reduction dbscan

asked Aug 13 '18 at 07:28

PascalIv

votes

1 answer

Is (a) multicollinearity and/or (b) binary variables an issue for DBSCAN? if so, how can one correct for these issues?

I have read some related questions, such as: Why are mixed data a problem for euclidean-based clustering algorithms?, What data structure to use for my cluster analysis or what cluster analysis to use for my data?, K-Means Clustering with Dummy…

clustering binary-data multicollinearity mixed-type-data dbscan

asked Feb 22 '18 at 22:23

Mark White

8,712
4
23
61

votes

0 answers

Cluster Algorithm for multidimensional data

My goal is to cluster data (20000 samples with a range from 0.0 to 1.0, and 14 dimensions/features). Since I don't know the number of clusters, I tried using MeanShift and DBSCAN. My problem with these algorithms is that they find one large cluster…

clustering scikit-learn k-means dbscan mean-shift

asked Dec 29 '21 at 11:29

Flitschi

votes

2 answers

Suitable approach to cluster histogram-like dataset using HDBSCAN implementation in python

My dataset below shows product sales per price (link to download dataset csv): price quantity 0 5098.0 20 1 5098.5 40 2 5099.0 10 3 5100.0 90 4 5100.5 20 .. ... ... 290 5247.0 …

machine-learning python scikit-learn kernel-smoothing dbscan

asked Aug 03 '21 at 17:32

Eduardo Gomes

votes

1 answer

Clustering data with covariance for each point

I am looking to cluster data points that each have a covariance around itself (based on some function of its neighbourhood, but how I got it is not important). I would like to use the covariance to achieve these properties: points with relatively…

clustering hierarchical-clustering high-dimensional metric dbscan

asked May 09 '19 at 22:31

LemonPi

2 3 4 5 6 Next