173

Need to understand the working of 'Embedding' layer in Keras library. I execute the following code in Python

import numpy as np
from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()
model.add(Embedding(5, 2, input_length=5))

input_array = np.random.randint(5, size=(1, 5))

model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)

which gives the following output

input_array = [[4 1 3 3 3]]
output_array = 
[[[ 0.03126476  0.00527241]
  [-0.02369716 -0.02856163]
  [ 0.0055749   0.01492429]
  [ 0.0055749   0.01492429]
  [ 0.0055749   0.01492429]]]

I understand that each value in the input_array is mapped to 2 element vector in the output_array, so a 1 X 4 vector gives 1 X 4 X 2 vectors. But how are the mapped values computed?

prashanth
  • 3,747
  • 4
  • 21
  • 33

3 Answers3

221

In fact, the output vectors are not computed from the input using any mathematical operation. Instead, each input integer is used as the index to access a table that contains all posible vectors. That is the reason why you need to specify the size of the vocabulary as the first argument (so the table can be initialized).

The most common application of this layer is for text processing. Let's see a simple example. Our training set consists only of two phrases:

Hope to see you soon

Nice to see you again

So we can encode these phrases by assigning each word a unique integer number (by order of appearance in our training dataset for example). Then our phrases could be rewritten as:

[0, 1, 2, 3, 4]

[5, 1, 2, 3, 6]

Now imagine we want to train a network whose first layer is an embeding layer. In this case, we should initialize it as follows:

Embedding(7, 2, input_length=5)

The first argument (7) is the number of distinct words in the training set. The second argument (2) indicates the size of the embedding vectors. The input_length argumet, of course, determines the size of each input sequence.

Once the network has been trained, we can get the weights of the embedding layer, which in this case will be of size (7, 2) and can be thought as the table used to map integers to embedding vectors:

+------------+------------+
|   index    |  Embedding |
+------------+------------+
|     0      | [1.2, 3.1] |
|     1      | [0.1, 4.2] |
|     2      | [1.0, 3.1] |
|     3      | [0.3, 2.1] |
|     4      | [2.2, 1.4] |
|     5      | [0.7, 1.7] |
|     6      | [4.1, 2.0] |
+------------+------------+

So according to these embeddings, our second training phrase will be represented as:

[[0.7, 1.7], [0.1, 4.2], [1.0, 3.1], [0.3, 2.1], [4.1, 2.0]]

It might seem counter intuitive at first, but the underlying automatic differentiation engines (e.g., Tensorflow or Theano) manage to optimize these vectors associated to each input integer just like any other parameter of your model.

For an intuition of how this table lookup is implemented as a mathematical operation which can be handled by the automatic differentiation engines, consider the embeddings table from the example as a (7, 2) matrix. Then, for a given word, you create a one-hot vector based on its index and multiply it by the embeddings matrix, effectively replicating a lookup. For instance, for the word "soon" the index is 4, and the one-hot vector is [0, 0, 0, 0, 1, 0, 0]. If you multiply this (1, 7) matrix by the (7, 2) embeddings matrix you get the desired two dimensional embedding, which in this case is [2.2, 1.4].

It is also interesting to use the embeddings learned by other methods/people in different domains (see https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) as done in [1].

[1] López-Sánchez, D., Herrero, J. R., Arrieta, A. G., & Corchado, J. M. Hybridizing metric learning and case-based reasoning for adaptable clickbait detection. Applied Intelligence, 1-16.

Daniel López
  • 5,164
  • 2
  • 21
  • 42
  • 12
    Thank you for the answer. Just one query that how are the weights of the embedding layer obtained. Like for the index 0, how is [1.2, 3.1] obtained. – prashanth Sep 29 '17 at 15:39
  • 17
    The contents of the table that relates indexes to embedding vectors (i.e., the weights of the embedding layer) are initialized at random and then optimized by the training algorithm (e.g., Gradient Descent). – Daniel López Sep 29 '17 at 19:27
  • 9
    Thanks. I'm still a bit unclear what the optimizer would optimizer against? Like, what is the "correct answer" that allows it to compute a loss function? Or said another way, what is it doing for the forward and backward pass? – bwest87 Dec 05 '17 at 00:28
  • 1
    That will depend on the problem you are trying to solve. Note that after the embedding layer you can use an arbitrary architecture of layers. So for instance, if you are working on a classification problem, your network will typically end with a softmax layer and you can use any standard loss such as MSE. – Daniel López Dec 05 '17 at 14:43
  • 7
    so ... embedding is basically just a subnetwork of the overall architecture which reduces any one-hot encoded inputs down into fewer inputs, afaict.. – Mike Campbell Dec 14 '17 at 09:55
  • 2
    Since embedding layer is trainable, how sensitive it is to values missing in training set? Let's say, I've got ten words in training set and five more in test set - my vocabulary length is 15... but the layer actually is never activated by those five 'test' words during training. Could you please explain this situation? – mikalai Oct 16 '18 at 21:34
  • 3
    @mikalai, To the best of my knowledge, only the embeddings corresponding to words that appear in the training set get updated during the training process. If you initialize the embedding layer with additional embeddings for words that are not in the training set these will remain as initialized during the training process. Also, I have noticed that Keras implementation raises an exception when you ask the embedding layer to process a sequence with an integer which is not in the table (i.e., a new word). This is something important to keep in mind when pre-processing. – Daniel López Oct 17 '18 at 09:35
  • How embeding layer in initilized? Looks like it uniform by default https://keras.io/layers/embeddings/ `embeddings_initializer='uniform'` – mrgloom Apr 16 '19 at 01:14
  • 9
    Best explanation i have read so far. Thanks – Biranchi Jul 20 '19 at 08:26
  • @DanielLópez I understand that loss calculation and embeddings update depends on the problem. But in the simplest model case, like in OP's, what does `mse` calculated against? What are the `ground truth` values? – Sergey Bushmanov Nov 23 '19 at 10:04
  • @SergeyBushmanov I am guessing that these are the initialized values before the training as indicated by @ mrgloom – prashanth Nov 25 '19 at 09:22
  • 1
    @prashanth Good point! In OP's model is compiled but never trained. It would be interesting to see the training part. – Sergey Bushmanov Nov 25 '19 at 14:36
  • Beautiful and easy to digest explanation. I still have one thing not clear. why is the embedding vector two dimensional? – emdi Apr 15 '20 at 08:22
  • @emdi I used 2D embedding to keep the example short, but you can use any number of dimensions for them. – Daniel López Apr 25 '20 at 11:24
  • @DanielLópez would you mind explaining this process with an example where inputs are IDs, not words. I think if someone understands `word2vec` it is easy to understand the Embedding. But when you have IDs as input to the embedding layer, what is the `ground truth` value? – nad Jun 11 '20 at 15:55
  • What about the full sentence, how would it be embedded? Would it be like `Nice to see you again -> [5, 1, 2, 3, 6] -> [0, 1, 1, 1, 0, 1, 1]`? – Fernando Wittmann Jun 01 '21 at 02:27
17

I also had the same question and after reading a couple of posts and materials I think I figured out what embedding layer role is.

I think this post is also helpful to understand, however, I really find Daniel's answer convenient to digest. But I also got the idea behind it mainly by understanding the embedding words.

I believe it's inaccurate to say embedding layers reduce one-hot encoding input down to fewer inputs. After all the one-hot vector is a one-dimensional data and it is indeed turned into 2 dimensions in our case. Better to be said that

embedding layer comes up with a relation of the inputs in another dimension

Whether it's in 2 dimensions or even higher.

I also find a very interesting similarity between word embedding to the Principal Component Analysis. Although the name might look complicated the concept is straightforward. What PCA does is to define a set of data based on some general rules (so-called principle components). So it's like having a data and you want to describe it but using only 2 components. Which in this sense is very similar to word embeddings. They both do the same-alike job in different context. You can find out more here. I hope maybe understanding PCA helps understanding embedding layers through analogy.

To wrap up, the answer to the original question of the post that "how does it calculate the value?" would be:

  • Basically, our neural network captures underlying structure of the inputs (our sentences) and puts relation between words in our vocabulary into a higher dimension (let's say 2) by optimization.
  • Deeper understanding would say that the frequency of each word appearing with another word from our vocabulary influences (in a very naive approach we can calculate it by hand)
  • Aforementioned frequency could be one of many underlying structures that NN can capture
  • You can find the intuition on the youtube link explaining the word embeddings
Novin Shahroudi
  • 281
  • 2
  • 6
  • 10
    Nice point of view. However, I think it is worth noting that while word-embedding techniques such as word2vec try to capture the full meaning of words in the resulting embedding, the embedding layer in a supervised network might not learn such a semantically-rich and general representation. For example, if your network is trained to do sentiment classification, it will probably just group/cluster words in the embedding according to their "emotional" load. Nevertheless, based on my experience it is often useful to initialize your embedding layer with weights learned by word2vec on a big corpus. – Daniel López Jan 04 '18 at 10:08
  • 6
    one-hot vector is not one dimensional data. Its dimension is size of the vocabulary. – Binu Jasim Jan 07 '18 at 22:42
  • 3
    @BinuJasim you're right. The **one-hot vectors** representing a vocabulary is not a one-dimensional data. But the information that it represents is indeed a one dimensional and every entity within the vocabulary is a one-dimensional data. It's true that we have n*w (n = vocabulary size, w = number of bits) elements but each binary value represents a vector which again is a one-dimensional input. – Novin Shahroudi Jan 15 '18 at 14:53
  • @NovinShahroudi Brilliant, thanks for the explanation. – Benyamin Jafari Oct 11 '19 at 12:37
12

If you're more interested in the "mechanics", the embedding layer is basically a matrix which can be considered a transformation from your discrete and sparse 1-hot-vector into a continuous and dense latent space. Only to save the computation, you don't actually do the matrix multiplication, as it is redundant in the case of 1-hot-vectors.

So, say you have a vocabulary size of 5000, as your input dimension - and you want to find a 256 dimension output representation of it - you will have a (5000,256) shape matrix, which you "should" multiply your 1-hot vector representation to get the latent vector. Only in practice instead of multiplying you just take the index...

Source: Andrew Ng

(One way that helps me think of it in theory, is as a Dense layer only without bias or activation... )

The weights of this matrix are learned through training - you could train it as a Word2Vec, GloVe, etc. - or on the specific task that you are dealing with. Or you can load pre-trained weights (say GloVe) and continue training on your specific task.

Maverick Meerkat
  • 2,147
  • 14
  • 27