3

Multiple references are clear on how a single word is one-hot encoded in an Embedding layer, but what about sentences?

In order to illustrate an example, I will use the following SO reference. Let's suppose my training set consists of those two phrases:

Hope to see you soon

Nice to see you again

We can encode as the following indexes:

[0, 1, 2, 3, 4]

[5, 1, 2, 3, 6]

Next, we could feed them into the following keras Embedding layer:

Embedding(7, 2, input_length=5)

In the end, the embedding vectors can be mapped as in the following example:

+------------+------------+
|   index    |  Embedding |
+------------+------------+
|     0      | [1.2, 3.1] |
|     1      | [0.1, 4.2] |
|     2      | [1.0, 3.1] |
|     3      | [0.3, 2.1] |
|     4      | [2.2, 1.4] |
|     5      | [0.7, 1.7] |
|     6      | [4.1, 2.0] |
+------------+------------+

Internally, I understand that the Embedding layer is a densely connected network receiving one-hot encoded-words, for instance, given the for the word "soon" the index is 4, and the one-hot vector is [0, 0, 0, 0, 1, 0, 0].

Now, here's the question: in the previous example, I actually have a sentence instead of just a single word. It is not clear to me, for example, how would the following sentence be encoded:

Nice to see you again -> [5, 1, 2, 3, 6] -> ?

My first guess is that each sentence would be transformed into a 2d vector with one dimension per word, for example:

[[0, 0, 0, 0, 0, 1, 0], #nice
 [0, 1, 0, 0, 0, 0, 0], #to
 [0, 0, 1, 0, 0, 0, 0], #see
 [0, 0, 0, 1, 0, 0, 0], #you
 [0, 0, 0, 0, 0, 0, 1]] #again

However, the guess does not make sense as 2d vector is not compatible with the dimension of a dense layer. My second guess is that all of the words would be one hot encoded as a single vector, for example:

Nice to see you again -> [5, 1, 2, 3, 6] -> [0, 1, 1, 1, 0, 1, 1]

Even though it makes more sense, this solution does not seem to take into account the order of the words.

I would really appreciate anyone who could make that clear for me!

1 Answers1

3

You're right, that the input sentence is encoded by a matrix of 1-hot vectors.

Just write out the matrix dimension. A sentence has $w$ words in it, and there are $v$ total words in the entire vocabulary. The embedding matrix puts them into $k$ vectors. So the input matrix has dimension $w \times v$ and the embedding matrix has shape $v \times k$. After multiplication, the result is a $w \times k$ matrix: $w$ different vectors of $k$ elements.

I think the key detail here is that the input matrix needs one dimension to have $v$ entries to encode which word in the vocabulary appears at that position.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Thanks for the answer! So my first guess was correct? It is still not fully clear yet, but I will try to transform your answer into an example – Fernando Wittmann Jun 01 '21 at 18:54
  • I think your first guess is on the right track. I didn't really understand why, in your question, you objected that the first guess was incorrect, so that's what I set out to explain in my answer. As for creating an example, all you need to do is write down the two matrices and confirm that multiplication with 1-hot vectors is just selecting the desired value from the other matrix. – Sycorax Jun 01 '21 at 19:52