1

How do log likelihoods function in practice? I seem to oscillate between understanding this and not understanding this (which most likely means I've never understood it).

When you take log( P( X | Y ) ), where P ( X | Y ) is a probability on the interval [0,1], why is it that you don't end up with log(0) calculations? I'm particularly concerned with numerical calculations.

In numerical practice is 0 just replaced with something very small, like 1e-20? (NOTE: @whuber's comment below answers this part)

EDIT 8/3: Here's my specific case:

I am taking a likelihood P(X = 1 | R = r), where X represents boolean pixel values {0,1}, and R represents locations in an image. I build a model based on training data where I look at every location and determine the frequency with which location r takes the value 1. The result is a density estimation in which there are plenty of locations with probability 0.

How can I then take this log-likelihood without certain locations being assigned -Infinity? The next step for me is to add this to another log-likelihood calculated over the image. Adding -Infinity to another likelihood screws everything up.

marcman
  • 185
  • 1
  • 2
  • 9
  • When would you bother trying to calculate the log-likelihood for a case that has no chance of happening? – Glen_b Jul 26 '15 at 05:36
  • Well that's what I was wondering about. Do you just ignore it knowing that in the likelihood function it would just zero-out? – marcman Jul 26 '15 at 06:02
  • 4
    Can you suggest an example where an observation that *cannot occur under the model* would not immediately cause you to either throw out the data or throw out the model? (One point with likelihood 0 makes the entire likelihood 0. That data set cannot occur with that model!) When rolling a six sided die, I don't usually concern myself with the likelihood for it coming up 17. If I get a 17, either it's a typo, or I wasn't rolling a normal six sided die. [The.two possibilities could be treated differently, so no, there's no blanket rule.] – Glen_b Jul 26 '15 at 06:30
  • 1
    As about your second answer, see [this thread](http://stats.stackexchange.com/questions/30728/how-small-a-quantity-should-be-added-to-x-to-avoid-taking-the-log-of-zero), the problem with adding "very small" quantity to your data is that as it gets smaller, then log of it tends to get closer and closer to $-\infty$, so by using such transformation you get outliers in your data. – Tim Jul 26 '15 at 07:14
  • This can occur for a continuous distribution when the data are actually rounded. – Stéphane Laurent Jul 26 '15 at 09:57
  • @Glen_b: Given a conditional distribution where you're processing all data, whether they meet the condition or not. For a simple example, processing an binary image where you only care about pixels with the value 1. Yes, you can just have some statement ignoring 0-value pixels, but if you are doing some sort of convolution, you're going to be processing everything, 0 and 1. – marcman Jul 26 '15 at 15:18
  • 1
    Perhaps you should clarify the underlying situation; it sounds like you're adding terms to the likelihood that aren't part of the likelihood. – Glen_b Jul 26 '15 at 22:17
  • 3
    @marcman: of course pixels can have the *value* zero. But the probability model you use (and from which you calculate the likelihood) should not assign a zero *probability* for a pixel to have the value zero. – Stephan Kolassa Jul 27 '15 at 05:33
  • 1
    @Glen_b Many maximum likelihood calculations are based on the log probability *density* rather than the probability itself. It is perfectly possible for an outcome with zero density actually to occur. Pathologies would exist if a distribution family were such that for at least one possible sample all likelihoods would be zero. Careful formulations of ML include conditions on the distribution families to preclude such things. In practice one *does* run into numerical underflow problems and care is needed. – whuber Jul 27 '15 at 13:57
  • 1
    Closely related: http://stats.stackexchange.com/questions/142254. Maybe this even answers your question, marcman? (The answers clearly show that a rule like replacing $0$ with $10^{-20}$ or even $10^{-20000}$ could lead to disaster in real, practical applications.) – whuber Jul 27 '15 at 14:00
  • @Glen_b: I added my specific circumstances to the question – marcman Aug 03 '15 at 17:59
  • That doesn't sound like a likelihood to me, but if you want to treat it as one, 0 is still 0. – Glen_b Aug 03 '15 at 23:18

1 Answers1

2

This has never been a problem for me in practice. The likelihood is the distribution of the data $X_1,...,X_n$ conditional on some parameter vector $\mathbf{y}$ i.e $$ X_i \stackrel{iid}{\sim}f(X_i|\,\mathbf{y}) $$ and therefore $$ Pr(X_1,...,X_n|\mathbf{y})=\prod_{i=1}^nf(X_i|\,\mathbf{y}) $$ You would never choose a likelihood whose sample space didn't include all the $X_i$ because that would not make any sense. For example, if some of the $X_i$ fall outside of the interval $[0,1]$, it would not make sense to write $$ X_i\stackrel{iid}{\sim}beta(X_i|\,\alpha,\beta) $$

There are cases where the log likelihood can get very small, such as when you have a lots of data and/or a very "noisy" model. But even if the log-likelihood is -1e6 it is still workable with.

And no, log(0)=-Inf, you don't replace it with anything, and when you do get it from a log-likelihood, it's usually because there is something wrong with the way you formulated or coded your model.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Zachary Blumenfeld
  • 3,826
  • 1
  • 14
  • 21
  • I'm still not quite understanding. In my case, I am taking a likelihood like P(X = 1 | R = r), where X represents boolean pixel values {0,1}, and R represents locations in an image. I build a model based on training data where I look at every location and determine the frequency with which location r takes the value 1. The result is a density estimation in which there a plenty of locations with probability 0. How do I then convert this to a log likelihood? – marcman Aug 03 '15 at 17:50
  • 1
    You don't. You appear to have confounded the likelihood with your own data-based estimates of the probabilities. Perhaps you might benefit from reviewing some concepts of likelihood, such as the thread at http://stats.stackexchange.com/questions/2641. – whuber Aug 03 '15 at 18:00
  • @whuber: I'll take a look at that now. Could you please specify a bit further where I'm going wrong? I think you're spot on with that though. If anyone wants to delve further, I am trying to implement this paper: http://cseweb.ucsd.edu/~gary/pubs/Zhang-et-al-2008-accepted.pdf. My question refers to calculating equation 8 on page 6. Specific to this paper (and possibly in general to a lot of image processing), the terms probability and likelihood seem to be used interchangeably. – marcman Aug 03 '15 at 18:06
  • At a glance, that equation does not look like it's intended to be used in implementing anything. The authors immediately reduce it to the much simpler equation (11). – whuber Aug 03 '15 at 18:09
  • @whuber: Nah, that's just to demonstrate the efficacy their system without any prior location information. It's just ignored rather than rendered unnecessary. – marcman Aug 03 '15 at 18:10
  • @whuber: Are you thus saying that my density estimate does not show P(X = 1 | R) ? – marcman Aug 03 '15 at 18:21
  • 1
    Although you haven't disclosed any details of your calculations, it certainly is not the case that $P(X=1|R)$ is determined by means of any kind of estimate: it is given by your *model* and as such will be a function of one or more *parameters* whose values you do not know (for otherwise you wouldn't make them parameters in the first place). But here we are discussing what a likelihood is, when there already are fine discussions elsewhere on this site. I refer you to them for more information. – whuber Aug 03 '15 at 18:26
  • what about for algorithms where you calculate the likelihood recursively/using dynamic programming? With these methods it is common to have initial states of likelihood zero – user3494047 Nov 26 '20 at 17:44