Cluster Data based on Distribution

Question

I have a list of diseases for my research. For each disease, I have a list of ages for the diseases. "Breast Carcinoma" may be a list of [1,2,2,3,4,5,5,5,5,5] while "Adrenal Cortex Neoplasms" maybe be a different list with a thousand elements, BUT with the same general shape in a bincount (high number of 5s, a few 2s). I would like to stratify these diseases based on the shape of the bincount distributions. I am quite new to machine learning, and I honestly have no clue as to how I can begin. However, if you can give me a general approach that I can research more in detail, then I could code a python algorithm for what I wish for. By stratification, I imply a clustering of the data using the distributions.

Thank you for taking the time to help a beginner!

How do you expect these ages to vary? Is there a maximum age? For example, would it be more appropriate to model the age distribution for each disease as a multinomial or a poisson? My first thought would be to use some kind of dirichlet process clustering on the diseases with respect to the age distribution, but that may be too involved if this is supposed to be a quick task? — , Jun 11 '14 at 19:51

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

I would use a finite mixture model to cluster the histograms. I do not know of any implementations of finite mixture modeling in Python, but there must be. Alternatively, you could do this in R using the mclust package. The setup is to first bin the data by age group. Then you feed the bin data as rows into the Mclust() function. This function fits a series of finite mixture models with varying number of clusters, and with varying parameters governing the characteristics of the covariance matrix (which in turn governs the orientation, shape, size, and variance thereof across the clusters). It computes the BIC for each of these models. Due to the Mclust computes BIC, higher values of BIC are preferable. So you will look for the model with the highest BIC. Also be sure to check the BIC profile to make sure that the model with highest BIC has higher enough BIC (says, at least a difference of two in BIC from the next strongest model) to make all other models ignorable, or if you should consider other models. Also make sure that the model with highest BIC doesn't just have highest BIC because it is a boundary of the number of clusters considered. I also recommend plotting the data by cluster. Here is an example from my own research, where we clustered Normalized Difference Vegetation Index data by geographic-pixel-year.

Ankur Chakravarthy · Answer 2 · 2014-06-16T01:04:32.990

Non-negative matrix factorisation seems suited to this kind of problem; alternatively, give Random Forests a go, but I generally use R and am not sure if Python implementations do pattern discovery (clustering).

The matrices would be tumour type along one dimension and the percentage of cases that fall into a particular age group on the other - if a particular age isn't represented in a tumour you need to set the count of that bin to 0.

To use an example

Age_Bin(Features)        BRCA    ACC    Colorectal Cancer  Kidney Cancer   
30-40                     8        0       24                10
40-50                     54      20       35                30
50-60                     24      35       35                40

et cetera. NNMF will then attempt to split the tumour types into clusters based on features defined by age bins; so, to apply the method, a good starting point would be to identify all bins represented in the dataset, set counts to 0 for those tumour types that have no representation in a bin, and then apply the algorithm to find clusters; as has been done here, to use an example...

http://gdac.broadinstitute.org/runs/analyses__2014_04_16/reports/cancer/HNSC/miRseq_Mature_Clustering_CNMF/nozzle.html

The methods section in that report explains how to select the optimal number of clusters et cetera.

Additionally - here is a package for NNMF using Python should you be interested http://nimfa.biolab.si/

Could you flesh out some of the essential details? In particular, to what matrix do you refer and how would it be computed from the kinds of data described in the question? — whuber, Jun 11 '14 at 19:24
I agree. I am a little confused as to how I would go about doing this from my limited knowledge of clustering — indiaash524, Jun 11 '14 at 19:34

Cluster Data based on Distribution

2 Answers2