2

I'm reading this page about cross-entropy loss and why it works as a maximum likelihood estimator.

He says:

Because we usually assume that our samples are independent and identically distributed, the likelihood over all of our examples decomposes into a product over the likelihoods of individual examples:

Then he gives an example: if our NN predicts (0.4, 0.1, 0.5) as the probabilities of the three classes, and the correct value is (1.0, 0.0, 0.0), then he says that the "likelihood" of that single example is just 0.4.

As I understand it, L(x, theta) = f(x, theta) for a single observation. How are we supposed to measure f(x, theta) when its value is itself a distribution? That is, what does it mean to have an outcome of (0.4, 0.1, 0.5) given a distribution of (1.0, 0.0, 0.0)?

Edit: if the question were "what is the probability of outcome 1 given distribution (0.4, 0.1, 0.5)", then there's no confusion. I've been interpreting it as "what is the probability of (0.4, 0.1, 0.5) given (1.0, 0.0, 0.0)" which seems awkward.

monk
  • 395
  • 2
  • 10
  • As this is a neural net, it doesn't really have a likelihood model on the output - What this 0.4 refers to, is most likely just the value for the cross-entropy-loss of that instance, which is - log(predicted probability for true class) – Sam Aug 30 '17 at 17:19
  • 1
    This looks perfectly ordinary to me, if we understand "predicts (0.4, 0.1, 0.5)" to mean that the estimated underlying distribution over three mutually exclusive classes assigns those probabilities to them. Since the likelihood--by definition--is the probability of the data and the data consist of the first class (which is what I presume "(1.0, 0.0, 0.0)" is intended to mean), then indeed its probability is $0.4$ according to the chosen model. See https://stats.stackexchange.com/questions/2641. – whuber Aug 30 '17 at 17:43
  • (@Sam) In this case, (0.4, 0.1, 0.5) is the output of a softmax layer (i.e., they are probabilities). We're then using it to calculate cross entropy loss. Does that help? – monk Aug 30 '17 at 17:44
  • @Sam using cross-entropy loss arrives at the maximum likelihood estimator as a special case of the neural net. – AdamO Aug 30 '17 at 17:51
  • @whuber: I'm still confused. The "_likelihood_ of a single example" L(x, $theta$) should be the same as the _probability_ (density) at p(x, $theta$) for that example, no? In this context, the _outcome_ is x = (0.4, 0.1, 0.5), isn't it? So we're effectively looking for "the _probability_ of (0.4, 0.1, 0.5) (given the parameters)" aren't we? (Equivalently, the likelihood of those params given that x). If we were looking for the probability of (1.0, 0.0, 0.0) (which can be interpreted as "getting outcome 1") given that the correct distribution is (0.4, 0.1, 0.5) then I'd understand. – monk Aug 30 '17 at 18:00
  • 2
    The *outcome* (often called $x$) is class 1, if I'm reading your notation correctly. The *probability law* is $(0.4,0.1,0.5)$, often referred to as $\theta$. Thus the chance of this outcome for this particular probability law is $p_\theta(x)=0.4$. That is the likelihood of $\theta$ associated with the observation $x$. – whuber Aug 30 '17 at 18:05
  • @whuber: I think we're narrowing in on the source of my confusion. Since the network is predicting (0.4, 0.1, 0.5), I'm thinking of that as the outcome. I have to think about why that's backwards. – monk Aug 30 '17 at 18:16
  • @monk If you were a Bayesian you could speak of the probability of (0.4, 0.1, 0.5) but that would require a prior. The likelihood requires no prior. The likelihood L(x=1, theta=(0.4, 0.1, 0.5)) is 0.4, We do *not* know that (0.4, 0.1, 0.5) is the correct multinomial distribution. Theta=1,0,0 is the most likely one. (0.01, 0.01, 0.98) could in fact be the correct distribution. the class x for multinomial density is parsed out into a matrix of indicators for evaluating likelihood (X==1, X==2, X==3). – AdamO Aug 30 '17 at 18:17
  • 1
    Yes, I think you might be using some terms differently than intended. If I understand correctly what the network is doing, it is giving you its guess about a discrete distribution on three categories: that's $\theta$. It can be described by three non-negative numbers that sum to unity, as required of any probability distribution. An "outcome" is what we use to model the *data* ("our samples" or "examples" in your quotation). Its value $x$ can be any one of those three categories. You have *encoded* that value using three random variables. Their values are $(1,0,0)$, indicating category 1. – whuber Aug 30 '17 at 18:52

2 Answers2

1

I'm not totally sure what you mean by "its value is itself a distribution," so let me say a few things and see if they help; feel free to ask more questions if not.

The network is predicting a discrete distribution over the three entries. Letting the predictive label be the random variable $\hat Y$ and naming its possible values $a$, $b$, and $c$, it says that $\Pr(\hat Y = a) = 0.4$, $\Pr(\hat Y = b) = 0.1$, and $\Pr(\hat Y = c) = 0.5$. Note that $\hat Y$ is a function of the network's parameters $\theta$ and the feature vector $x$: we can write it as $\hat Y_\theta(x)$ to denote its dependence on $\theta$ and $x$.

Now, we want to see if that predicted distribution is any good. We only have one data point to evaluate this with: the true observed value $y$, which in this case was observed as $a$. Taking a maximum-likelihood approach, we choose to evaluate the quality of a network $\theta$ by its likelihood: the probability of $a$ under the predictive distribution $\Pr(\hat Y_\theta(x) = \cdot)$, which we can evaluate as $P(\hat Y_\theta(x) = a) = 0.4$. (If the labels were continuous, then we'd use the probability density.)

Now, the network actually predicts one of these distributions for each of the possible inputs $x_i$; our measure of the overall quality of the network as a predictor is the sum of the likelihoods for each data sample. Because we assume these are iid, we get $$ \log \Pr\left( \big( \hat Y_\theta(x^{(i)}) \big)_{i=1}^n = \big( y^{(i)} \big)_{i=1}^n \right) = \log \prod_{i=1}^n \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right) = \sum_{i=1}^n \log \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right) .$$

The log-likelihood of a parameter value $\theta$ under the data $\{(x^{(i)}, y^{(i)}) \}_{i=1}^n$ is then $$ \ell(\theta) = \sum_{i=1}^n \log \Pr(\hat Y_\theta(x^{(i)}) = y^{(i)}) ,$$ which is what we want to maximize.


Compare to the case of finding the maximum-likelihood estimator for a series of biased coin flips. There the model is $\mathrm{Bernoulli}(\theta)$, i.e. $\Pr(\hat Y_\theta = H) = \theta$, $\Pr(\hat Y_\theta = T) = 1 - \theta$. The log-likelihood is $$\sum_{i=1}^n \begin{cases}\log(\theta) & y^{(i)} = H \\ \log(1 - \theta) & y^{(i)} = T\end{cases},$$ and we can estimate $\theta$ by maximizing $\ell(\theta)$ given the $y^{(i)}$.

The only difference in this case is that there are also feature vectors $x^{(i)}$, and we're maximzing the likelihood conditional on the $x^{(i)}$.

Danica
  • 21,852
  • 1
  • 59
  • 115
  • Thanks! I meant that (as you say) "the network is predicting a discrete distribution." When computing likelihoods, I'd only seen examples where the outcome is a single value. Suppose I have a coin where P(H)=p. We might ask "what is the likelihood of _p_ given 'HTH'?" The question "what is the likelihood of p given 'H'?" is equivalent to "what is the probability of 'H' given p?" Here we have "what is the probability of (0.4, 0.1, 0.5) given (1.0, 0.0, 0.0)?" I don't know how to make sense of that. What you call "the natural way" sounds good to me, but is there theoretical justification? – monk Aug 30 '17 at 17:31
  • Also, when you say "the likelihood of $a$," you mean the _probability_ of $a$, right? Likelihood should be of some parameter? – monk Aug 30 '17 at 17:34
  • Does my edit help? – Danica Aug 30 '17 at 17:41
  • I think the discussion on the question has pointed out where I went wrong: if the *outcome* is (1, 0, 0) and the distribution we're comparing with is (0.4, 0.1, 0.5) then I understand how to interpret the likelihood. But I've been assuming the that (0.4, 0.1, 0.5) is the _outcome_. – monk Aug 30 '17 at 18:19
  • Yes, $(0.4, 0.1, 0.5)$ is the _model_ (conditional on $x$): in my notation, it's $\hat Y_\theta(x)$. The _outcome_ is $a$ in my notation, $(1, 0, 0)$ in yours, which refer to the same thing. – Danica Aug 30 '17 at 18:20
  • 1
    It's all coming together now: my network is supposed to be outputting a *model* which achieves maximum likelihood given the labeled *outcomes*. That makes so much more sense than whatever I was thinking. Thanks! – monk Aug 30 '17 at 18:26
1

The question is confused: I was thinking of the predictions of the network as outcomes under a given model (represented by the labels). This is backward; the network is generating a model (parameterized by the network weights and inputs) under which the labels are the outcomes.

Therefore, the likelihood of a single example is equal to the probability of getting the labeled outcome (1.0, 0.0, 0.0) (referring to class 1) given the predicted distribution (0.4, 0.1, 0.5) and not vice-versa.

monk
  • 395
  • 2
  • 10