Understanding maximum likelihood estimate in the context without any specific distribution and different data sets

Question

I came across following difference in probabilities vs maximum likelihood estimated (MLE) in this video:

So, basically, in finding probability, we are finding the "probability of different data values" for the given distribution. In finding MLE, we are trying to find the "probability of different distributions (its parameters like mean and standard distribution)" that can fit the given data.

Here, one thing is sure, we deal with two things: a distribution and data points. Now I was reading Jurafsky's book. He states following points:

Bigram model approximates the probability of a word given all the previous words $P(w_n|w_{1:n−1})$ by using only the conditional probability of the preceding word $P(w_n|w_{n−1})$.

We can compute the probability of a complete word sequence as follows: $$P(w_{1:n})\approx \prod_{k=1}^n P(w_k|w_{k-1})$$

How do we estimate these bigram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE.to compute a particular bigram probability of a word y given a previous word $x$, we’ll compute the count of the bigram $C(xy)$ and normalize by the sum of all the bigrams that share the same first word $x$: $$P(w_n|w_{n-1})=\frac{C(w_{n-1}w_n)}{\sum_wC(w_{n-1}w)}=\frac{C(w_{n-1})w_n}{C(w_{n-1})}$$

The book further says following how above counts as MLE:

In MLE, the resulting parameter set maximizes the likelihood of the training set $T$ given the model $M$ (i.e., $P(T|M)$). For example, suppose the word Chinese occurs 400 times in a corpus of a million words like the Brown corpus. What is the probability that a random word selected from some other text of, say, a million words will be the word Chinese? The MLE of its probability is $\frac{400}{1000000}$ or $0.0004$. Now $0.0004$ is not the best possible estimate of the probability of Chinese occurring in all situations; it might turn out that in some other corpus or context Chinese is a very unlikely word. But it is the probability that makes it most likely that Chinese will occur 400 times in a million-word corpus.

I am still trying to grasp how books example can certify as MLE despite it does not involve estimation of parameters of any distribution?

Q1. Is it like, more generally, MLE is mere estimation of some value (here, 0.0004) which can increase the possibility of existence of some other data (here another text of a million words)?

Q2. So is it ok to not deal with estimating some specific parameters of some specific distribution and still call it MLE?

Q3. Is it ok to get value on one data set and call it MLE for another data set?

You have a model with very many parameters, for instance, each bigram has its own probability, which are linked only by summing to 1. This is effectively a non-parametric model, and what is describes *is* MLE. See https://stats.stackexchange.com/questions/112451/maximum-likelihood-estimation-mle-in-layman-terms/112480#112480 — kjetil b halvorsen, Aug 26 '21 at 00:03

Understanding maximum likelihood estimate in the context without any specific distribution and different data sets

0 Answers0