A statistical language model is a probability distribution over sequences of words.
Questions tagged [language-models]
100 questions
16
votes
1 answer
What are the pros and cons of applying pointwise mutual information on a word cooccurrence matrix before SVD?
One way to generate word embeddings is as follows (mirror):
Get a corpora, e.g. "I enjoy flying. I like NLP. I like deep learning."
Build the word cooccurrence matrix from it:
Perform SVD on $X$, and keep the first $k$ columns of U.
Each row…

Franck Dernoncourt
- 42,093
- 30
- 155
- 271
16
votes
3 answers
In Kneser-Ney smoothing, how are unseen words handled?
From what I have seen, the (second-order) Kneser-Ney smoothing formula is in some way or another given as
$
\begin{align}
P^2_{KN}(w_n|w_{n-1}) &= \frac{\max \left\{ C\left(w_{n-1}, w_n\right) - D, 0\right\}}{\sum_{w'} C\left(w_{n-1}, w'\right)} +…

sunside
- 311
- 3
- 10
11
votes
2 answers
Question about Continuous Bag of Words
I'm having trouble understanding this sentence:
The first proposed architecture is similar to the feedforward NNLM,
where the non-linear hidden layer is removed and the projection
layer is shared for all words (not just the projection…

user70394
- 263
- 1
- 3
- 8
10
votes
3 answers
Regarding using bigram (N-gram) model to build feature vector for text document
A traditional approach of feature construction for text mining is bag-of-words approach, and can be enhanced using tf-idf for setting up the feature vector characterizing a given text document. At present, I am trying to using bi-gram language model…

user3125
- 2,617
- 4
- 25
- 33
8
votes
1 answer
Language modeling: why is adding up to 1 so important?
In many natural language processing applications such as spelling correction, machine translation and speech recognition, we use language models. Language models are created usually by counting how often sequences of words (n-grams) occur in a large…

user9617
- 183
- 4
7
votes
2 answers
Does trigram guarantee to perform more accurately than bigram?
When implementing some NLP project, such as text segmentation, Name Entity Recognition, does using trigram guarantee to perform more accurately than bigram?
$$
Trigram: p(s_t\mid s_{t-2}, s_{t-1})
$$
$$
Bigram: p(s_t\mid s_{t-1})
$$
EDIT: I was…

xiaoyao
- 385
- 4
- 10
7
votes
2 answers
Calculating test-time perplexity for seq2seq (RNN) language models
To compute the perplexity of a language model (LM) on a test sentence $s=w_1,\dots,w_n$ we need to compute all next-word predictions $P(w_1), P(w_2|w_1),\dots,P(w_n|w_1,\dots,w_{n-1})$.
My question is: How are these terms computed for a seq2seq…

xhi
- 96
- 5
6
votes
2 answers
Neural network language model - prediction for the word at the center or the right of context words
Neural network language model - prediction for the word at the center or the right of context words?
On Bengio's paper, the model predicts probability by n words for the next word, like predicting probabilities of "book", "car", etc., by n words…

Tom
- 788
- 8
- 16
6
votes
1 answer
n-gram language model
At the end of the introduction of A Neural Probabilistic Language Model (Bengio et al. 2003), the following example is given:
Having seen the sentence The cat is walking in the bedroom in the
training corpus should help us generalize to make the…

Antoine
- 5,740
- 7
- 29
- 53
5
votes
1 answer
Why are Transformers "suboptimal" for language modeling but not for translation?
Language Models with Transformers states:
Transformer architectures are suboptimal for language model itself. Neither self-attention nor the positional encoding in the Transformer is able to efficiently incorporate the word-level sequential context…

MWB
- 1,143
- 9
- 18
5
votes
1 answer
Why can't standard conditional language models be trained left-to-right *and* right-to-left?
From the BERT paper:
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict…

user2740
- 1,226
- 2
- 12
- 19
5
votes
1 answer
Advantage of character based language models over word based
Is there an intuition why character based models language bases models are preferred over word based. For example Karpathy builds his language model by predicting the next character in Karpathy Blog.
The aspect I am struggling with is that not each…

PKuhn
- 201
- 2
- 4
4
votes
1 answer
How does one design a custom loss function? What features make a loss function "good"?
I have a custom situation for which I am trying to design a cost function.
The idea is that you have a stack of LSTMs doing something slightly unconventional. Each LSTM$_l$ computes a linear transformation of its hidden layer $V_{l-1}h^t_l$ to…

Sam
- 153
- 6
4
votes
2 answers
Generating text from language model
I have a trained LSTM language model and want to use it to generate text. The standard approach for this seems to be:
Apply softmax function
Take a weighted random choice to determine next word
This is working reasonably well for me, but it would…

Christian Doucette
- 185
- 3
4
votes
1 answer
Skip-gram algorithm confusion
As a newbie to NLP, I am (deeply) confused by the middle step in the following diagram explaining the skip-gram algorithm. The video where this diagram was presented can be found at:
https://www.youtube.com/watch?v=ERibwqs9p38 (Highly appreciate…

MeiNan Zhu
- 327
- 2
- 12