2

I have been trying to make a language model that predict the next word, but with the assumption that there are multiple "correct" answers.

Input: dictionary indices + document topic data for initial states Output: one-hot vector as long as the vocab size, expecting probability of the next word

And I thought it would be cool to be able to show multiple suggestions ex) next will be word1 with prob 0.5, word2 with prob 0.3, etc.

from tensorflow.keras.layers import Input, Embedding, Concatenate, Dense, LSTM, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.initializers import Constant

vocab_size = 3527
categ_count = 21
W_SIZE = 5
network_size = 4096

# Network
word_input = Input(W_SIZE, name="Word_Input")
categ_input = Input(categ_count, name="Category_Input")

word_embed = Embedding(vocab_size, 300, input_length=W_SIZE, embeddings_initializer=Constant(embedding_matrix), trainable = True, name="Embedding_Layer")(word_input)
dense_h = Dense(network_size, activation="relu", name="Initial_h")(categ_input)
dense_c = Dense(network_size, activation="relu", name="Initial_c")(categ_input)

lstm1 = LSTM(network_size, dropout=0.3, return_sequences=True, name="LSTM_1")(word_embed, initial_state=[dense_h, dense_c])
lstm2 = LSTM(network_size, dropout=0.3, name="LSTM_2")(lstm1)
output = Dense(vocab_size, activation="softmax", name="Output")(lstm2)

model = Model([word_input, categ_input], output)

model.summary()

However I discovered that traditional NNs are not good with getting probability distributions since they push toward confident predictions by punishing ambiguous predictions harshly. I have tried to look at some modules in tensorflow probability but I cannot figure out how they could be used in my model.

Is there a way to add on/edit to the my current code so that I can get probability of each words as an output?

1 Answers1

1

Softmax gives a probability as an output, so the bold question is trivially answered. But I think what you're asking for is how to limit the model's confidence. Two options come to mind which do not involve changing anything about how your network is trained, or its architecture.

  1. What you could do is smooth the predictions. Given a vector $v$ with $k$ entries $v=\left[\frac{1}{k}, \frac{1}{k}, ... \frac{1}{k}\right]$, take a convex combination of the LSTM output $\hat{y}$: $$\lambda \hat{y} + (1 - \lambda) v $$ This makes the predictions less confident by smoothing it with a vector $v$ that is uniform over its predictions, with the amount of smoothing controlled by $0 \le \lambda \le 1$.

    For $\lambda >0$, this changes the confidence without changing the relative ordering (the largest is still the largest after taking the convex combination, likewise the smallest, etc).

  2. Another alternative is to use a temperature in the softmax. Replace the typical softmax expression with $$p(j) = \frac{\exp(x_j/T)}{\sum_{i=1}^k \exp(x_i/T) } .$$ The constant $T$ controls how confident the predictions are, but does not change the relative ordering of the values. What is the role of temperature in Softmax?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • thanks for the answer, I understand that the probability values could be tweaked afterwards without changing the structure of the network, but does it also mean that the rankings given by the LSTM is still consistent with the actual distribution despite its nature to maximize confidence of only one value? I asked another question here (https://stats.stackexchange.com/questions/532907/multiple-likely-ys-for-one-instance-of-x-word-prediction-with-lstm) – alpaprika39 Jul 01 '21 at 09:51
  • What is the "actual distribution"? Your target vector is a one-hot binary vector, right? So to the extent that the LSTM outputs a value that's close to 1 in the correct position, and 0 elsewhere, it is very consistent with the actual distribution. – Sycorax Jul 01 '21 at 15:46