How can I apply Bayesian Statistics when the number of data that I have is 1?

Question

I want to show why Maximum Likelihood Estimation is not the right choice when the number of data that I have only is 1. For example, if I have an observed data which is heads from a coin toss. I only toss the coin once and apply Maximum Likelihood Estimation to get the probability of heads. If I apply Maximum Likelihood Estimation to my observed data, I will get 1 for the probability of heads up, tossing the coin only once.

How can I apply my prior experience, the probability of heads up is 0.5, to the situation and calculate the Bayesian way of gettting Maximum Likelihood Estimation which is Maximum a Posterior? I am not sure if I should get the mean of the posterior or the MAP.

A Bayesian approach returns a whole distribution, not an estimator. — Xi'an, Jan 17 '20 at 18:06
@whuber re: rejected edit. Line 7, there is a clear grammatical error ("Likelhood"). As you have rejected this correction, could you please either make a correction or propose that the OP makes it. I believe it was an erroneous rejection. But if not, this may somewhat discourage editing — PsychometStats, Jan 18 '20 at 16:05

Tim · Accepted Answer · 2020-01-18T09:30:17.000

TL;DR you can, but the result would strongly depend on your choice of prior.

With maximum likelihood, you would be maximizing the likelihood, that in this case is defined in terms of probability mass function $f$ of Bernoulli distribution, i.e. binomial distribution with number of trials $n=1$, parametrized by probability of success $\theta$

$$ \hat\theta = \underset{\theta}{\operatorname{arg\,max}} \; f(X|\theta) $$

In Bayesian setting what changes is that instead of looking for point estimate for $\theta$, we learn posterior distribution $\pi(\theta|X)$, and we start with a prior distribution $\pi(\theta)$ for $\theta$

$$ \pi(\theta|X) \propto f(X|\theta)\,\pi(\theta) $$

when calculating maximum a posteriori point estimate, you would be maximizing the posterior probability

$$ \hat\theta = \underset{\theta}{\operatorname{arg\,max}} \; f(X|\theta) \,\pi(\theta) $$

As you can see, what changes, is that we multiply likelihood by prior. In case of binomial distribution, if we choose beta distribution as a prior, then there exists nice, closed-form solution. If as a prior we choose

$$ \theta \sim \mathsf{Beta}(\alpha, \beta) $$

then the posterior distribution is

$$ \theta|X \sim \mathsf{Beta}(\alpha + x, \beta+ n-x) $$

where $n=1$ is the number of trials, and $x=1$ is the number of successes. So in your case, the mean of the distribution is

$$ E[\theta|X] = \frac{\alpha + 1}{\alpha+1+\beta} $$

For details, you can check the great What is the intuition behind beta distribution? thread. As you can see, choosing different prior parameters $\alpha$, $\beta$, would lead to different results and would have significant impact on the final estimate. If you want to assume a priori the probability to be something close to $0.5$, you need to set $\alpha$, $\beta$ to same values. For example, if you set $\alpha=\beta=1$, you would estimate $E[\theta|X]$ to be $0.66$, while $\alpha=\beta=0.5$ would lead you to estimating it as $0.75$. This impact would diminish with growing sample size, but with single sample it would be quite profound. So using Bayesian approach would enable you to estimate something more reasonable then $\hat\theta=\tfrac{1}{1}=1$, but how reasonable the estimate would be, would depend on how reasonable your prior was.

As a sidenote, your example is not that uncommon. In fact, it is often the case to use Bayesian estimators for calculating probabilities when we expect to see zero counts. It is commonly used for working with textual data, where we deal with counts of words. Obviously, some words occur very frequent, e.g. "and", "the", while other are pretty rare, e.g. "aardvark". Estimating probabilities for the common words is straightforward, but for rare words we would end up $\tfrac{0}{n}$ as estimated probabilities. When using algorithms like Naive Bayes, where we multiply the probabilities by each other, this would lead to zeroing-out everything after plugging-in single zero to the formula, that is why we use Laplace smoothing.

Thank you for the answer. I just wonder what I should do with the value that I have found using Maximum a Posterior. Just pretending I wonder the probability of heads up with 1 coin tossing experiment using either Bernoulli or Binomial Mass Function, Can I apply the value that I have found using Maximum a Posterior just like I have found the value from Maximum Likelihood Estimation to either Bernoulli or Binomial Mass Function to find the probability of 1 heads up? I just want to make sure if I am thing correctly. — StoryMay, Jan 22 '20 at 18:13
@ChangheeKang MAP is a point estimate, like any other point estimate, e.g. MLE. — Tim, Jan 22 '20 at 19:46

score 2 · Answer 2 · answered Jan 17 '20 at 20:34

When the number of data values is less than the number of parameters in a statistical model the statistical model does not usually provide anything useful. You have one datum and one parameter (the probability of heads) and the datum does not convey very much information because it is a dichotomous outcome rather than continuous, so don't expect too much.

The calculation that you need is not based on the maximal likelihood estimate and the prior maximum because, as X'ian points out in his comment, the relevant Bayesian calculation uses the whole range of parameter values (i.e. all probabilities of heads from zero to one).

You have to provide a complete prior probability function for heads. What you say means that its peak is at 0.5, but you have to decide on how wide the function is, and its shape. (The function has Pr(heads) on the x-axis and probability density on the y-axis).

Give the observation of a single heads, the likelihood function is a right triangle with its maximum at Pr(heads)=1 and minimum of 0 at Pr(heads)=0. Multiply that function by your prior probability density function and re-scale the result to have unit integral, and you have your posterior.

Your posterior probability function for Pr(heads) will not differ much from your prior if your prior is a narrow peak, but will differ more substantially if your prior includes a lot of weight at the Pr(heads)=0 end.

Interpretation of the results is straightforward: a. On the basis of the observation alone, the most likely probability of heads is 1, and Pr(heads)=0 can be ruled out. The likelihood of Pr(heads)=0.5 is half as high as the likelihood of Pr(heads)=1, but a different of two-fold is quite trivial in most cases and the likelihood function is very flat and so it is not very discriminating. b. Your prior expectation is that the most probable value for Pr(heads)=0.5. c. The posterior function shows how probable you should think each value of Pr(heads) is. Do not focus only on the maximum, but see what ranges of Pr(heads) are reasonably plausible (for some version of "reasonably" and "plausible"). A credible interval might be useful.

Note that more data can be added by simply multiplying that posterior by a new likelihood function.

How can I apply Bayesian Statistics when the number of data that I have is 1?

2 Answers2