Suitable approach to cluster histogram-like dataset using HDBSCAN implementation in python

Question

My dataset below shows product sales per price (link to download dataset csv):

     price      quantity
0    5098.0        20
1    5098.5        40
2    5099.0        10
3    5100.0        90
4    5100.5        20
..      ...       ...
290  5247.0       150
291  5247.5        30
292  5248.0       150
293  5248.5        20
294  5249.0        55

[295 rows x 2 columns]

The image below illustratre my question. I added the blue line using a KernelDensity (KernelDensity(kernel='gaussian', bandwidth=1.5).fit(price,sample_weight=quantity)) for illustrative purpose.

So what I'm trying to achieve:

Cluster the dense regions, which are the red, green, dark blue and pink rectangles. This is the most importat thing to achieve in this question.
Get the price boundaries of each region, where the probability is low (shown with the yellow arrows on the bottom part). The issue here is that each region will have different density.
Get the peaks within each region (up arrows). The issue here is to detect 1-3 peaks of each region, not 1-3 peaks of the entire dataset (orange rectangle).

The algorithm that seems to better solve my problem is HDBSCAN, which has a hierarchical clustering approach and deals great with noise (present in my dataset):

import hdbscan
import plotly.express as px
import pandas as pd

data = pd.read_csv('data_set.csv')

clusterer = hdbscan.HDBSCAN(min_cluster_size=4,min_samples=8)
clusterer.fit(data)
data['cluster'] = clusterer.labels_

fig = px.bar(data,x='price',y='quantity',color='cluster',orientation='v')
fig.show()

The result clearly shows my newbiew skills: it is clustering the amplitudes, not the regions (amplitude combined with price range). I also tried normalizing the data (each axis subtracted from the dataset mean divided by the standard deviation), but no success.

Maybe it's just a transformation I have to do in the input dataset, so that HDBSCAN clusters the data properly. I'm going for the machine learning process because the dataset will vary its shape, so the parameters of the clustering method will have to adapt and be trained someday. The number of cluster will also vary depending on the data, and it has a carachterist that, as time goes, small regions tend to be visually grouped into a bigger region (like the blue and pink rectangles, which are almost forming one big region).

Finally, maybe just DBSCAN (most known), GaussianMixture or KernelDensity would suffice. I don't know. I'd really appreciate some help here. I tried HDBSCAN due to its density and hierachical approach that really fits my data (many small dense regions can be clustered as a big dense region with peaks).

Although it's a simple question, I'm still new to these algorithms. THANKS in advance!

The concepts and code at https://stats.stackexchange.com/a/428083/919 might help. Although it stops short of your goal by only identifying the peaks (cluster centers), once you select a set of peaks, it's simple enough to post-apply many other simple procedures (such as k-means, or even something *ad hoc* such as minimizing density between peaks) to find reasonable clusters around those peaks. — whuber, Dec 22 '21 at 17:50

score 4 · Answer 1 · answered Dec 22 '21 at 18:27

Don't ask about the algorithm: focus on solving your problem.

The peak-finding solution I posted at https://stats.stackexchange.com/a/428083/919 will help you analyze the situation and decide how many peaks to identify using its mode trace plot:

Either four or five peaks looks like a reasonable number to use: this one shows the best locations of five peaks, as requested in the question.

Following that up with a suitable, simple clustering algorithm to find clusters around these peaks ought to work. I used K-means (with the kmeans function) for this solution with your data:

You could easily use any number of other such ad hoc approaches, too, such as splitting the total quantities between each successive peaks into two equal halves. The options here are numerous and ought to be selected according to your objectives and understanding of the data.

I had to adapt the code to use quantities for weighting each call to density. A quick and dirty way to do this in your case is to replicate each price as many times as required by the corresponding quantity (use rep). Although the resulting dataset has 70,265 observations, that's still quite manageable.

score 0 · Answer 2 · answered Dec 22 '21 at 17:08

I met with the same question as yours. So after some investigation I got some useful information. To understand, how DBSCAN works with histograms, please refer to a nice article - https://pberba.github.io/stats/2020/01/17/hdbscan/ - so DBSCAN is not applicable for histograms in the way as we want.

In another good article I found another algorithm for clustering histogram dense regions, they are called UniDip and Skinny-dip - https://benjamindoran.github.io/motif-paper/ , UniDip algo code is here https://github.com/BenjaminDoran/unidip (reference at http://www.kdd.org/kdd2016/subtopic/view/skinny-dip-clustering-in-a-sea-of-noise )

I'll try to apply UniDip on mine and leave a feedback.

Suitable approach to cluster histogram-like dataset using HDBSCAN implementation in python

2 Answers2