Clustering after t-SNE in R

Question

As explained here, t-SNE maps high dimensional data such as word embedding into a lower dimension in such that the distance between two words roughly describe the similarity. It also begins to create naturally forming clusters. For example with the code

if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
pacman::p_load(dplyr)
# grab reviews
reviews_all = read.csv("https://raw.githubusercontent.com/rjsaito/Just-R- 
Things/master/NLP/sample_reviews_venom.csv", stringsAsFactors = F)
# create ID for reviews
review_df <- reviews_all %>%
  mutate(id = row_number())
str(reviews_all)
pacman::p_load(text2vec, tm, ggrepel)
tokens <- space_tokenizer(reviews_all$comments %>%
                          tolower() %>%
                          removePunctuation())
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)

vocab <- prune_vocabulary(vocab, term_count_min = 5L)

# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

glove = GlobalVectors$new(rank = 50, x_max = 10)
glove$fit_transform(tcm, n_iter = 20)

word_vectors = glove$components

# load packages
pacman::p_load(tm, Rtsne, tibble, tidytext, scales)

# create vector of words to keep, before applying tsne (i.e. remove stop words)
keep_words <- setdiff(colnames(word_vectors), stopwords())

# keep words in vector
word_vec <- word_vectors[, keep_words]

# prepare data frame to train
train_df <- data.frame(t(word_vec)) %>%
  rownames_to_column("word")

# train tsne for visualization
tsne <- Rtsne(train_df[,-1], dims = 2, perplexity = 50, verbose=TRUE, max_iter = 500)


# create plot
colors = rainbow(length(unique(train_df$word)))
names(colors) = unique(train_df$word)

plot_df <- data.frame(tsne$Y) %>%
  mutate(
    word = train_df$word,
    col = colors[train_df$word]
  ) %>%
  left_join(vocab, by = c("word" = "term")) %>%
  filter(doc_count >= 20)

p <- ggplot(plot_df, aes(X1, X2)) +
  geom_text(aes(X1, X2, label = word, color = col), size = 3) +
  xlab("") + ylab("") +
  theme(legend.position = "none") 
p

We obtain the following picture.

What can be done after this preliminary analysis? is it possible to get the word list for each cluster? Or can a clustering algorithm be applied to the points represented in the image (and stored in plot_df)?

See https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne — Aleksejs Fomins, Jan 18 '21 at 10:36

score 3 · Accepted Answer · answered Jan 18 '21 at 09:46

3

You can, of course, use any numerical clustering (k-means, hierarchical clustering, spectral clustering...) on the projected data (the points). But why should it be better than clustering in the original, high dimensional feature space?

answered Jan 18 '21 at 09:46

Igor F.

6,004
1
16
41

Thank you! Then I should cluster directly `train_df`? – Mark Jan 19 '21 at 08:48
I don't see a reason against it. – Igor F. Jan 19 '21 at 10:37

Clustering after t-SNE in R

1 Answers1