"Uniform" Clustering

Asked Apr 28 '17 at 17:38

Active Apr 28 '17 at 18:15

Viewed 83 times

I have the data $\{(y_i,\omega_i), i=1,\dots,n\}$, where $y_i \in \mathbb{R}_+$ the response and $\omega_i>0$ a weight.

Fixed $K>0$ I want to determinate $y_1^*,\dots,y_{K-1}^*$ such that $$ \frac{\sum_{i: y_i \in ]y_{k-1}^*,y_k^*]} \omega_i}{\sum_{i=1}^n \omega_i} \approx \frac{1}{K}, \ \forall \ k \in \{1,\dots,K\}, $$ where $y_0^*=0$ and $y_K^*=\infty$.

Clearly the solution is not unique and depend on the precision chosen.

A solution is to use a greedy algorithm but I would want to know if there are better solutions or something out of the box with R or SAS.

edited Apr 28 '17 at 18:15

asked Apr 28 '17 at 17:38

Madara

Are your data naturally ordered or would breaking them into subsets of equal weight do? – nth Apr 28 '17 at 17:49
I have to maintein their order, in fact I want to obtain intervals $]y_{k-1}^*,y_k^*]$. – Madara Apr 28 '17 at 17:50
Is this really different from [weighted quantiles](https://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample)? – GeoMatt22 Apr 28 '17 at 18:08
1

I don't understand why there is anything at all to be solved here: assuming you want the $y_k^{*}$ to be in ascending order, you simply scan across the $y_i$ from smallest to largest, accumulating the weights as you go, and identifying the points where the sum crosses multiples of $1/K$. There may be some ambiguity right at those crossing points, but resolving it depends on information not provided here: namely, how to measure the sense of closeness implied by the $\approx$ operator. – whuber Apr 28 '17 at 19:22

"Uniform" Clustering

0 Answers0