Questions tagged [language-models]

A statistical language model is a probability distribution over sequences of words.

100 questions
16
votes
1 answer

What are the pros and cons of applying pointwise mutual information on a word cooccurrence matrix before SVD?

One way to generate word embeddings is as follows (mirror): Get a corpora, e.g. "I enjoy flying. I like NLP. I like deep learning." Build the word cooccurrence matrix from it: Perform SVD on $X$, and keep the first $k$ columns of U. Each row…
16
votes
3 answers

In Kneser-Ney smoothing, how are unseen words handled?

From what I have seen, the (second-order) Kneser-Ney smoothing formula is in some way or another given as $ \begin{align} P^2_{KN}(w_n|w_{n-1}) &= \frac{\max \left\{ C\left(w_{n-1}, w_n\right) - D, 0\right\}}{\sum_{w'} C\left(w_{n-1}, w'\right)} +…
11
votes
2 answers

Question about Continuous Bag of Words

I'm having trouble understanding this sentence: The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection…
10
votes
3 answers

Regarding using bigram (N-gram) model to build feature vector for text document

A traditional approach of feature construction for text mining is bag-of-words approach, and can be enhanced using tf-idf for setting up the feature vector characterizing a given text document. At present, I am trying to using bi-gram language model…
8
votes
1 answer

Language modeling: why is adding up to 1 so important?

In many natural language processing applications such as spelling correction, machine translation and speech recognition, we use language models. Language models are created usually by counting how often sequences of words (n-grams) occur in a large…
7
votes
2 answers

Does trigram guarantee to perform more accurately than bigram?

When implementing some NLP project, such as text segmentation, Name Entity Recognition, does using trigram guarantee to perform more accurately than bigram? $$ Trigram: p(s_t\mid s_{t-2}, s_{t-1}) $$ $$ Bigram: p(s_t\mid s_{t-1}) $$ EDIT: I was…
xiaoyao
  • 385
  • 4
  • 10
7
votes
2 answers

Calculating test-time perplexity for seq2seq (RNN) language models

To compute the perplexity of a language model (LM) on a test sentence $s=w_1,\dots,w_n$ we need to compute all next-word predictions $P(w_1), P(w_2|w_1),\dots,P(w_n|w_1,\dots,w_{n-1})$. My question is: How are these terms computed for a seq2seq…
6
votes
2 answers

Neural network language model - prediction for the word at the center or the right of context words

Neural network language model - prediction for the word at the center or the right of context words? On Bengio's paper, the model predicts probability by n words for the next word, like predicting probabilities of "book", "car", etc., by n words…
Tom
  • 788
  • 8
  • 16
6
votes
1 answer

n-gram language model

At the end of the introduction of A Neural Probabilistic Language Model (Bengio et al. 2003), the following example is given: Having seen the sentence The cat is walking in the bedroom in the training corpus should help us generalize to make the…
Antoine
  • 5,740
  • 7
  • 29
  • 53
5
votes
1 answer

Why are Transformers "suboptimal" for language modeling but not for translation?

Language Models with Transformers states: Transformer architectures are suboptimal for language model itself. Neither self-attention nor the positional encoding in the Transformer is able to efficiently incorporate the word-level sequential context…
5
votes
1 answer

Why can't standard conditional language models be trained left-to-right *and* right-to-left?

From the BERT paper: Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict…
user2740
  • 1,226
  • 2
  • 12
  • 19
5
votes
1 answer

Advantage of character based language models over word based

Is there an intuition why character based models language bases models are preferred over word based. For example Karpathy builds his language model by predicting the next character in Karpathy Blog. The aspect I am struggling with is that not each…
4
votes
1 answer

How does one design a custom loss function? What features make a loss function "good"?

I have a custom situation for which I am trying to design a cost function. The idea is that you have a stack of LSTMs doing something slightly unconventional. Each LSTM$_l$ computes a linear transformation of its hidden layer $V_{l-1}h^t_l$ to…
Sam
  • 153
  • 6
4
votes
2 answers

Generating text from language model

I have a trained LSTM language model and want to use it to generate text. The standard approach for this seems to be: Apply softmax function Take a weighted random choice to determine next word This is working reasonably well for me, but it would…
4
votes
1 answer

Skip-gram algorithm confusion

As a newbie to NLP, I am (deeply) confused by the middle step in the following diagram explaining the skip-gram algorithm. The video where this diagram was presented can be found at: https://www.youtube.com/watch?v=ERibwqs9p38 (Highly appreciate…
1
2 3 4 5 6 7