Difference between dimensionality reduction and clustering

Question

General practice for clustering is to do some sort of linear/non-linear dimensionality reduction before clustering esp. if the number the number of features are high(say n). In case of linear dimensionality reduction technique like PCA, the objective is to find principal orthogonal components(say m) that can explain most variance in the data such that m<<n when n is high.

But for non-linear dimensionality reduction techniques like auto-encoders, can the reduced dimensions, itself be clusters that indicate different modes of operation example for industrial components. Am I missing something here or is my understanding of non-linear dimensionally reduction wrong? Any help is appreciated.

This question might be too basic for some, so please don't be extremely critical of the question if you don't want to answer it.

@fk128 shared his interpretation of my question that might be better understood and easy to interpret than what I have mentioned above

"General practice" says who? I disagree. I avoid this whenever possible. — Has QUIT--Anony-Mousse, Apr 29 '18 at 06:37
Most of the research papers and even the package creators for example hdbscan recommends dimensionality reduction before applying clustering esp. If the number of dimensions are more than 50. How do you do clustering when the number of dimensions are very high with out dimensionality reduction ? — RTM, Apr 29 '18 at 06:40
If you have such data, the axes will likely have very different meaning. You cannot reliably weight them. So whatever you do (PCA or not) it is statistically pretty much "black magic" aka nonsense. The result will not have any mathematical properties. — Has QUIT--Anony-Mousse, Apr 29 '18 at 06:44
Not everything that people do is correct or sensible. It's often just to get "some" output because their boss wants some output.and when nobody can verify the quality, it cannot be wrong. But to be of scientific value, you need to be able to test and falsify things... but you can't do this anymore. — Has QUIT--Anony-Mousse, Apr 29 '18 at 06:47

Has QUIT--Anony-Mousse · Answer 1 · 2018-04-29T06:42:34.720

1

The components of an autoencoder are supposedly even less reliable than your usual clustering.

Why don't you just try it: train autoencoders on some data sets, and visualize the "clusters" you get from the components?

While this great answer on tSNE for clustering is specific for tSNE, I believe the results for other such encoders will be similar: they will cause fake clusters because of emphasizing some random fluctuations in data.

edited Apr 29 '18 at 06:42

answered Apr 29 '18 at 06:39

Has QUIT--Anony-Mousse

39,639
7
61
96

Can you clarify on what do you mean by reliable in unsupervised learning ? – RTM Apr 29 '18 at 06:41
Clustering is never very reliable. – Has QUIT--Anony-Mousse Apr 29 '18 at 06:42
Thanks for sharing the link to great answer. I can understand the problem with tSNE as it doesn't preserve either density or distance from original space. Does all the dimesnionalty reduction techniques follow logic. UMAP is a recent dimensionality reduction technique that is hypothesised to be a good precursor to density based clustering. Have you checked that out yet? – RTM Apr 29 '18 at 06:50
PCA (but without using the eigenvalues for scaling!) and random projections (c.f. Johnson Lindenstrass lemma) are supposed to preserve density optimally. But I do not see any benefit of using it. Never heard of UMAP, so I cannot comment on it. Beware that these drawbacks of tSNE appear to be relatively new insights, not in the original publication. So are there any independent evaluations of UMAP wrt. this feature? – Has QUIT--Anony-Mousse Apr 29 '18 at 09:36
I am currently using UMAP on my dataset, but I am failing to say how well it performed since I am doing unsupervised learning on large data sets. I have few events that happened during the timeline of the data. I am using analysis of anomalies and clusters formed before the event, to understand how well the stack of methods that I performed are working together as a whole. – RTM May 01 '18 at 16:40
Also, I have gone through some of your answers on other clustering questions and I have picked up some techniques/skills just by going through them. Thanks for that. From one of your answer, I have found out about Charu C agarwal's book on _outlier analysis_ and have gone through first two chapters. It is one of the best book I have read so far about outlier analysis that has all the good stuff in one place. – RTM May 01 '18 at 16:41
I don't like that book that much. A bit too egocentric. – Has QUIT--Anony-Mousse May 01 '18 at 22:57

score 1 · Answer 2 · answered Apr 29 '18 at 07:04

There are two types of clustering: hard and soft. Hard is when you assign a specific data point to a single cluster/category. So if there are $k$ clusters, a data point $x$ can only be assigned to one cluster in $\{1,..., k\}$. Soft/fuzzy is when a data point is assigned to multiple clusters, and the assignment is represented with membership weights, such that all the weights add up to 1. So for example if there are 3 clusters, a datapoint $x$ can have weights $w = [0.1,0.6,0.3]$ such that $0.1 + 0.6 + 0.3 = 1$. You can also use the weights representation in the hard clustering case, so you'd only have $w = [1, 0, 0]$, where one component is 1 and the rest are zero (this is also referred to as one-hot encoding).

Any sort of dimensionality reduction, linear or non-linear, reduces the dimensions of the input features from $n$ to say $m < n$. So if you have a data point $x$ with dimension $n$, it is transformed into a data point $x'$ with dimension $m < n$.

Assuming I understand your question correctly, your question is whether with an auto-encoder the components of $x'$ can represent cluster membership weights, i.e. $x' = w$. However, this is extremely unlikely to be the case.

Typically, to obtain clusters, you can later run any clustering algorithm on the dimensionally reduced data, regardless whether it was obtained from a linear or non-linear dimensionality reduction technique.

Your interpretation is correct and is much better at explaining my question than what I have written above. So I have made an edit to look at your response for better understanding. Thanks for the answer. Is there any intuitive or theoretical understanding of why these results has the potential to be different. Since neural networks(NN) has the potential to be good universal approximatiors, shouldn't they ideally give same result if the latent space embedding is really good ? — RTM, Apr 29 '18 at 07:23
Clustering and dimensionality reduction are two different things. By analogy, you can think of a supervised learning task like classification as 'supervised clustering' as the goal is to predict a category for each input. This task is really difficult to do in an unsupervised way, which is why the state-of-the-art methods are all supervised. However, there are some NN methods that could be thought of as doing clustering, e.g. Self-Organising Maps, see https://www.sciencedirect.com/science/article/pii/S089360800900207X — fk64, Apr 29 '18 at 11:08

score 0 · Answer 3 · answered Dec 27 '21 at 17:16

In my opinion these are two distinct questions, please add any citations that might clarift I find the question quite interesting

1) Difference between dimensionality reduction and clustering eg in PCA

The core difference between the 2 is:

a. Clustering = group rows together (often with useful properties eg i want group X elements to be similar to each other). = so for dataset size N with dimensionality D at the end you will have M size (<N) with dimensionality D. -> reduce number of rows (data points) b. Dimensionality reduction -> reduce the number dimensions ( = columns ). columns are not same as rows ie a vector with 2 or 3 dimensions and a vector with 1 dimension are both valid data - points.

2) But for non-linear dimensionality reduction techniques like auto-encoders, can the reduced dimensions, itself be clusters that indicate different modes of operation example for industrial components.

The autoencoder neural network architecture performs dimensionality reduction in its latentspace (assuming smaller number of neurons than input layer ) BUT one could add further constraints to tranditionl MSE loss function. Such a constraint could to force the latent space to be a one-hot vector forcing 1 neuron to be close to the scalar value of 1 and the rest to 0. Which can be considered a form of clustering.

Further thoughts Dimensionality reduction and clustering are problems algorithms can be modified to solve many problem. AE in particular has been used as a component to more complex architectures (and with one layer AE performs linear mapping) that solve many different problems so keep an open mind before putting it to "boxes".

Difference between dimensionality reduction and clustering

3 Answers3