Clustering on n features while maximizing the heterogeneity on m remaining features

Question

We have a random vector $X\sim p(X)$, and a set of realizations of the random vector $S=\{X_i\}_{i=1}^N$. The random vector has $n$ continuous and $m$ categorical features. I want to cluster $S$ so that datapoints with similar values of the continuous features end up in the same cluster, but at the same time I want to maximize the heterogeneity of the categorical features in each cluster. Example with $n=2, m=1, N=4$:

$$ S=\{(0.2, 0.4, \text{dog}),(-0.2, -0.4, \text{cat}),(-0.2, 0.4, \text{dog}),(0.2, -0.4, \text{cat})\}$$

If we didn't have the categorical variable, $\{X_1,X_3\}$ and $\{X_2,X_4\}$ would be "natural" clusters because they're the closest pairs. However, since we also want "diverse" clusters in terms of the categorical variable, we settle for $\{X_1,X_4\}$ and $\{X_2,X_3\}$. $\{X_1,X_2\}$ and $\{X_3,X_4\}$ would also be "diverse" clusters, but the points would be farther apart, so they seem to me a worse choice. Of course, this is just a toy example: a clustering task with 4 points and 2 clusters doesn't make a lot of sense.

Which algorithms could I use?

Cool question! You could perhaps use an existing clustering algorithm with a tweak: the distance between the categorical features should be treated as the negative thereof; distance(cat,dog) will be low but distance(cat,cat) or distance(dog,dog) will be high. The idea is that you only need a custom distance metric and then can use a traditional clustering algorithm. — Richard Hardy, Jan 29 '21 at 10:16

score 3 · Accepted Answer · answered Jan 29 '21 at 11:17

I will offer a simple example for numerical data. The idea behind it can be extended to the case of categorical data without major conceptual difficulty (as far as I can see).

Suppose you have two variables, $X$ and $Y$. You want to cluster data so that points with similar $X$ values and dissimilar $Y$ values end up in the same cluster while points with dissimilar $X$ values and similar $Y$ values end up in different clusters. If you only cared about $X$, you would use some traditional clustering algorithm. However, you have the problem of $Y$ which is the opposite to traditional clustering.

Note that a clever definition of distance* can fix this. If you define the distance so that dissimilar $Y$s get a low score but similar $Y$s get a high score, you are back to traditional clustering. One way of doing that is to reverse the sign of the $Y$ coordinate in some traditional distance metric. E.g. the traditional Manhattan distance between points $(i,j)$ is $$ d_M(i,j)=\mid x_i-x_j \mid + \mid y_i-y_j \mid $$ but the modified one would be $$ d_M'(i,j)=\mid x_i-x_j \mid - \mid y_i-y_j \mid. $$ If you base clustering on the latter, you get the desired result.

Another example: Euclidean distance. The traditional one is $$ d_E(i,j)=\sqrt{ (x_i-x_j)^2 + (y_i-y_j)^2 } $$ but the modified one would be $$ d_E'(i,j)=\text{sign}\left((x_i-x_j)^2 - (y_i-y_j)^2\right) \sqrt{\mid (x_i-x_j)^2 - (y_i-y_j)^2\mid }. $$

This idea should be possible to extend to distance measures for categorical variables by taking the negative of the distance for the $Y$ coordinate where needed.

*The cleverly defined distance should probably not be called distance anymore as it violates some conditions and is not intuitively a distance. But it will be used in place of distance in a clustering algorithm, so I have kept the term.

Practical consideration: add an `alpha` parameter in the pseudodistance - it will allow to control diversity. Regrading distance on categorical features: encode them e.g. with one-hot encoding and use PCA. — hans, Feb 02 '21 at 22:51
@hans how would you include $\alpha$? For example, how would you modify the above formulas for of $L_1$ and $L_2$ pseudodistances? — DeltaIV, Feb 04 '21 at 22:18
@DeltaIV, I suppose premultiply the distance component corresponding to a particular coordinate, e.g. use $\alpha(y_i-y_j)^2$ instead of $(y_i-y_j)^2$ inside of $d'_E$ and similarly in $d'_M$. This is the same as to scale some dimensions (variables, features) before calculating the distance. — Richard Hardy, Feb 05 '21 at 07:02

Clustering on n features while maximizing the heterogeneity on m remaining features

1 Answers1