What is the function that is being optimized in word2vec?

Question

The following question is about Skipgram, but it would be a plus (though not essential) to answer the question for the CBOW model as well.

Word2Vec uses neural networks, and neural networks learn by doing gradient descent on some objective function. So my question is:

How are the words inputted into a Word2Vec model? In other words, what part of the neural network is used to derive the vector representations of the words?
What part of the neural network are the context vectors pulled from?
What is the objective function which is being minimized?

Why don’t you read the paper on word 2 vec? It explains this all in great detail. Then come and ask questions — Aksakal, Nov 14 '21 at 03:29

Franck Dernoncourt · Answer 1 · 2019-11-30T20:26:10.040

How are the words inputted into a Word2Vec model? In other words, what part of the neural network is used to derive the vector representations of the words?

See Input vector representation vs output vector representation in word2vec

What is the objective function which is being minimized?

The original word2vec papers are notoriously unclear on some points pertaining to the training of the neural network (Why do so many publishing venues limit the length of paper submissions?). I advise you look at {1-4}, which answer this question.

References:

{1} Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014). https://arxiv.org/abs/1411.2738
{2} Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722 (2014). https://arxiv.org/abs/1402.3722
{3} TensorFlow's tutorial on Vector Representations of Words
{4} Stanford CS224N: NLP with Deep Learning by Christopher Manning | Winter 2019 | Lecture 2 – Word Vectors and Word Senses. https://youtu.be/kEMJRjEdNzM?t=1565 (mirror)

Lerner Zhang · Answer 2 · 2021-11-14T06:51:00.550

How are the words inputted into a Word2Vec model? In other words, what part of the neural network is used to derive the vector representations of the words?

As we can see from the above diagram, the words, "Hope" and "Set", are indexed as 1's in the vector, and then the $W_{3*5}$ matrix is used to derive the vector representation of the words.

What part of the neural network are the context vectors pulled from?

Word embedding vectors are fulled from the $W_{3*5}$ matrix, and context vectors are fulled from the $W'_{5*3}$ matrix.

What is the objective function which is being minimized?

The objective function is the cross entropy to compare the predicted probabilities and the actual targets.

There are two features in Word2Vec to speed things up:

Skip-gram Negative Sampling (SGNS) It changes the Softmax over the whole vocabulary to a multilable classification(multiple binary softmax functions) function over one right target and a few negatives sampled randomly, and then instead of updating all weights only a small part weight should be updated in each backpropogation pass.
Hierarchical Softmax Only the nodes along the path in the Huffman tree from the root to the word are considered[2].

What is the function that is being optimized in word2vec?

2 Answers2