Classification algorithm based on average distances from a test point to the points in each class

Question

Is there any classification algorithm that assign a new test vector to the cluster of points whose average distance is minimum?

Let me write it better: Let's imagine that we have $K$ clusters of $T_k$ points each. For each cluster k, I calculate the average of all the distances between $x(0)$ and $x(i)$, where $x(i)$ is a point in the cluster $k$.

The test point is assigned to the cluster with minimum of such distances.

Do you think this is a valid classification algorithm? In theory, if the cluster are "well-formed" like you have after a linear fished discriminant mapping, we should be able to have good classification accuracy.

What do you think of this algo? I've tried but the result is that the classification is strongly biased towards the cluster with the biggest number of elements.

def classify_avg_y_space(logging, y_train, y_tests, labels_indices):
    my_labels=[]
    distances=dict()
    avg_dist=dict()
    for key, value in labels_indices.items():
        distances[key] = sk.metrics.pairwise.euclidean_distances(y_tests, y_train[value])
        avg_dist[key]=np.average(distances[key], axis=1)

    for index, value in enumerate(y_tests):
      average_distances_test_cluster = { key : avg_dist[key][index] for key in labels_indices.keys() }
      my_labels.append(min(average_distances_test_cluster, key=average_distances_test_cluster.get))
    return my_labels

It is called assignment. Any distance function between a point and a class - linkage function (see https://stats.stackexchange.com/a/217742/3277) can be used, not only _between average linkage_ what you are using. I've implemented a function for SPSS which does assignment by various linkage functions. — ttnphns, Nov 22 '17 at 09:29

tmrlvi · Accepted Answer · 2017-11-22T11:42:04.893

It is a nice idea, but has one major flaw - it is too sensitive to the spread of the data.

To clarify the question, given $k$ disjoint clusters $ C_1, \ldots, C_k $, you ask whether it makes sense to classify a new sample $ x^* $ according to the rule $$ \arg\min_{i\in \left[k\right]} \frac{1}{\left| C_i \right|} \sum_{x \in C_i } \left\Vert x - x^*\right\Vert $$

Note that this rule is indeed similar to rules that exist as well known algorithms, like $$ \arg\min_{i\in \left[k\right]} \min_{x \in C_i } \left\Vert x- x^*\right\Vert $$ which is in fact 1-Nearest-Neighbors, or $$ \arg\min_{i\in \left[k\right]} \left\Vert \frac{1}{\left| C_i \right|} \sum_{x \in C_i }x - x^*\right\Vert$$ which in sklearn is called NearestCentroid, but is used by k-Means for cluster assignment and can be seen in LDA in the case where the underlying covariance matrix is the identity (up to scalar). (Note that in general, LDA also takes into account the shape [spread + orientation] of the clusters).

In many cases, the proposed rule will behave similarly to NearestCentroid, especially if the clusters are well separated and have similar variance (in such case, I think it is possible to bound the average distance in terms of the distance from the centroid).

However, as it averages distances over all the points in the cluster, it is blatantly biased toward low-variance clusters. I believe is the true source of the mislabelling you noticed.

To illustrate this effect, we can plot the decision boundary of our classifiers. Plots are shamelessly based on sklearn's example.

In the preceding plot, I generated two datasets from different normal distributions. The violet came from $$ \mathcal{N}\left(\begin{pmatrix}0 \\ 3\end{pmatrix}, \begin{pmatrix}10 & 2 \\ 2 & 1\end{pmatrix}^2\right)$$ and the yellow came from $$ \mathcal{N}\left(\begin{pmatrix}0 \\ -3\end{pmatrix}, \begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix}\right)$$ Then, each point in the space is colored according to the rule. The line separating the regions is the decision boundary. There are 200 points in the violet cluster and 50 in the yellow cluster. The + marks the centroid of each cluster. Note that the violet cluster is not aligned with the axes in order to emphasize the difference between LDA and Nearest Centroid.

This is an excellent illustration of an unintuitive point. Thank you, and welcome to CV! — Stephan Kolassa, Nov 22 '17 at 07:36
+1 Very nice, however I am confused by your upper-left plot. Nearest Centroid should have decision boundary perpendicular to the line connecting the two centroids. This does not seem to be the case. — amoeba, Nov 22 '17 at 08:10
Please describe the picture in the answer, in particular, what is the boundary line between the two areas. — ttnphns, Nov 22 '17 at 09:36
@amoeba You are right. But, note that the axis are not in the same scale (for the same reason the yellow cluster is not in the shape of a circle). — tmrlvi, Nov 22 '17 at 10:51
@ttnphns The line boundary line between the two areas is the decision boundaries. I added more description in the post. — tmrlvi, Nov 22 '17 at 11:03
@ttnphns I generated new data to fit smaller frame, and scaled the axis to be the same. Now it is evident that the decision boundary is perpendicular to the line joining the centroids. — tmrlvi, Nov 22 '17 at 11:18

Classification algorithm based on average distances from a test point to the points in each class

1 Answers1