1

As explained here, t-SNE maps high dimensional data such as word embedding into a lower dimension in such that the distance between two words roughly describe the similarity. It also begins to create naturally forming clusters. For example with the code

if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
pacman::p_load(dplyr)
# grab reviews
reviews_all = read.csv("https://raw.githubusercontent.com/rjsaito/Just-R- 
Things/master/NLP/sample_reviews_venom.csv", stringsAsFactors = F)
# create ID for reviews
review_df <- reviews_all %>%
  mutate(id = row_number())
str(reviews_all)
pacman::p_load(text2vec, tm, ggrepel)
tokens <- space_tokenizer(reviews_all$comments %>%
                          tolower() %>%
                          removePunctuation())
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)

vocab <- prune_vocabulary(vocab, term_count_min = 5L)

# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

glove = GlobalVectors$new(rank = 50, x_max = 10)
glove$fit_transform(tcm, n_iter = 20)

word_vectors = glove$components

# load packages
pacman::p_load(tm, Rtsne, tibble, tidytext, scales)

# create vector of words to keep, before applying tsne (i.e. remove stop words)
keep_words <- setdiff(colnames(word_vectors), stopwords())

# keep words in vector
word_vec <- word_vectors[, keep_words]

# prepare data frame to train
train_df <- data.frame(t(word_vec)) %>%
  rownames_to_column("word")

# train tsne for visualization
tsne <- Rtsne(train_df[,-1], dims = 2, perplexity = 50, verbose=TRUE, max_iter = 500)


# create plot
colors = rainbow(length(unique(train_df$word)))
names(colors) = unique(train_df$word)

plot_df <- data.frame(tsne$Y) %>%
  mutate(
    word = train_df$word,
    col = colors[train_df$word]
  ) %>%
  left_join(vocab, by = c("word" = "term")) %>%
  filter(doc_count >= 20)

p <- ggplot(plot_df, aes(X1, X2)) +
  geom_text(aes(X1, X2, label = word, color = col), size = 3) +
  xlab("") + ylab("") +
  theme(legend.position = "none") 
p

We obtain the following picture.

enter image description here

What can be done after this preliminary analysis? is it possible to get the word list for each cluster? Or can a clustering algorithm be applied to the points represented in the image (and stored in plot_df)?

Mark
  • 167
  • 6

1 Answers1

3

You can, of course, use any numerical clustering (k-means, hierarchical clustering, spectral clustering...) on the projected data (the points). But why should it be better than clustering in the original, high dimensional feature space?

Igor F.
  • 6,004
  • 1
  • 16
  • 41