0

The thing is, I have tons of 1d-data that is distributed around multiple different mean points, I'm searching for a general way of identifying this little clusters and somehow spreading them.

I've implemented KDE distributions fits for this. My first idea was to search for some way of finding local maxima of the KDE distribution, but that would not be very effective, because it would be depending of the bandwidth of the distribution fixed.

Example

Here is one example.

I'm looking for a general method, that would give me in this example the mean and standard deviation of this three clusters (but it could be more or less than three), considering only themselves, using maybe scipy.stat or sklearn.

Thank you !

Update: Gaussian Mixture might work really well for me, the main problem is just the shape of the of the fitting data, it is a matter of implementation.

from sklearn.mixture import GaussianMixture
from scipy.stats import norm
import numpy as np

mean=34
std=10

xpdf=np.linspace(20,50,1000).reshape(-1,1)
y=norm.pdf(xpdf,mean,std)
y=np.array(y).reshape(-1,1)

model=GaussianMixture(1).fit(y)
ypdf=np.exp(model.score_samples(xpdf))

plt.hist(y,bins=100,density=True)
plt.plot(xpdf,ypdf,'-r')
plt.show()
```
Lucas Tonon
  • 101
  • 2
  • 1
    Have you tried GaussianMixture in sklearn? – Georg M. Goerg Jun 09 '20 at 12:06
  • You're suggesting to fit GaussianMixture and then finding local maxima? Why would that be any different than doing it with Kernel Distribution KDE? – Lucas Tonon Jun 09 '20 at 12:13
  • The GaussianMixture gives you the three clusters as part of the fit, no need to do the 2nd step of finding local maxima. The mixture distribution component means _are_ the local modes and the scales are the spreads you are looking for. – Georg M. Goerg Jun 09 '20 at 12:23
  • 3
    What are the x and y axes? It looks like 2D data rather then 1D? – Tim Jun 09 '20 at 12:53
  • Sorry, I did not explain it quite well, the x axis is just a numeration, doesn't mean anything at all, y is the value I'm interested in. That is, I can represent it as a histogram as well if I want to. The reason I plotted that way is because, surprisingly, it gets easier to see than using a histogram. I will use Gaussian Mixture such as Goerge M. proposed and come back with the results. Thanks very much !! – Lucas Tonon Jun 10 '20 at 07:28
  • Update: Gaussian Mixture might work really well for me, the main problem is just the shape of the of the fitting data, it is a matter of implementation. The code I tried is in the question. The model will return an array of zeros. – Lucas Tonon Jun 10 '20 at 09:01
  • 1
    @LucasTonon I marked it as a duplicate on other thread that discusses 1D clustering. As about the code you can Google for ["sklearn GaussianMixture 1d"](https://www.google.com/search?q=sklearn+GaussianMixture+1d), e.g. https://www.astroml.org/book_figures/chapter4/fig_GMM_1D.html – Tim Jun 10 '20 at 09:11

0 Answers0