Maximum likelihood estimation and $\log(0)$

Question

I am thinking I can't be the only one encountering this. I am trying to do maximum likelihood estimation on a probit model, i.e. trying to find the most optimal fit for three parameters in my case through the following equation:

$$l\left(\beta; y, X\right) = \sum_{i=1}^N \left[y_i \ln\left(F\left(x_i \beta\right)\right) + \left(1 - y_i\right) \ln \left(1-F\left(x_i \beta\right)\right)\right]$$

$y_i$ can take the value 0 or 1 depending on whether there has been an event or not, and then some form of probability inside the $\ln$ function. So in my case I am just computing a large grid of parameters (which is used as input into the $F$ function), and then doing this computation over $i$ number of cases.

However, my problem is that some parameter sets (in the grid) will return either 1 or 0 as probability. That is unavoidable. And in those $i$-th cases (depending on whether the corresponding $y_i$ is 1 or 0) the result will be $\log(0)$, which is not something you would want I imagine.

How do one deal with this problem, or should it be dealt with at all ? My first assumption was to first calculate the $F$ function for all cases $i$ for all parameters in the grid. Then figure out which specific elements/indexes have the value 1 or 0, and subtract all combined indexes from all $i$ cases in the grid, so I wouldn't have 1 or 0 anywhere, and all cases has had those particular indexes removed all over the board. But that seems like some kind of weird solution to me, so that's what I am here for.

You can impose the rule that $0\times\log(0)=0$ and take $1\times\log(0)$ as a very large negative value. — Xi'an, Sep 04 '18 at 11:40
Why are probabilities of 0 and 1 unavoidable in a probit model? Or are you talking about some problem with finite precision? — The Laconic, Sep 04 '18 at 12:06
@TheLaconic I don't know if that is general, but in my case it is. Or at least the way I am using MLE on my data. I have like a 100x100x100 grid with 100 of each of the values I am trying to estimate (in a range I know is more probable than others). But obviously at some of the upper/lower bounds of the parameters they are so far from the true value that they in turn yield the probability 0 or 1 (that is the value $F$ can take in my case. Some value between 0 and 1). And if for that case $y_i$ is 0, and its probability is 1, then I will end up with $ln(0)$. — Denver Dang, Sep 04 '18 at 13:09
@Xi'an It just seems weird that while large MLE values indicate a good agreement with parameters, i.e. if you have a lot of $i$-th cases where the MLE is large, that particular set of parameters will most likely be the "chosen set" since that would maximize the log-loglikelihood the most when summed together for all $i$. And now I impose either a $0 \times \log(0) = 0$ for a set of parameters that should not be very likely, i.e. a high value since most MLE values in this case are negative, or another one, equally unlikely, as very unlikely (i.e. very large negative number). It seems wrong. — Denver Dang, Sep 04 '18 at 13:32
F is just the normal CDF. It doesn’t return 0 or 1 for a finite argument. So what’s going on here? Is it returning 0 or 1 to within numerical precision, for certain parameter values? — The Laconic, Sep 04 '18 at 14:03
Well, in my calculation (I'm using Pythons scipy norm.cdf) I do in fact get the value 1 or 0. The equations used can be seen in this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4786007/ at 1, 2 and 3. $NTCP$ is the CDF in this case, and the $t$ (used as input to the CDF) can vary a lot. $TD50$ can vary from 1 to 150 (or something similar), $m$ can vary from (0.01 to 3), and the $gEUD$ is perhaps around 10 to 100 (something like that). So the $t$ can be quite large depending on the values used. And as far as I can see, you don't have to be higher than 10 in order to get CDF of 1. — Denver Dang, Sep 04 '18 at 14:34
"And now I impose either a 0×log(0)=0 for a set of parameters that should not be very likely" I don't understand what you're saying here. If $y_i$ and F agree, then that point will not contribute anything to $l$. ($l$ looks at what didn't happen, and penalizes the model based on how much probability mass it assigned to that possibility). If one is 0 and the other is 1, then the model is a terrible fit (it is saying is in infinitely certain in a case that it is wrong). — Acccumulation, Sep 04 '18 at 18:51
The show Million Dollar Money Drop is actually a very good analogy for this process: if you put all your money on one option, and you're right, the amount of money you have doesn't change. If you put it all on one option and you're wrong, you lose everything. The final amount of money you're left with is a measure of how good you are at assigning probabilities. Similarly, if a model assigns 1 to a probability and is right, the likelihood doesn't change. If it's wrong, the likelihood drops to zero. — Acccumulation, Sep 04 '18 at 18:55

Sycorax · Accepted Answer · 2018-09-04T17:31:52.707

2

I agree with TheLaconic that this is a numerical precision issue; the easiest work-around would be to use the scipy.stats.norm.logcdf method, instead of computing the CDF and then taking the log, which sometimes creates exactly this kind of numerical precision issue.

For the component that depends on $1 - F(\cdot)$, consider that the so-called survival function $s(a)$ is defined as $s(a) = 1 - F(a)$, so we can use scipy.stats.norm.logsf to take care of that without even doing algebra.

edited Sep 04 '18 at 17:31

answered Sep 04 '18 at 16:58

Sycorax

76,417
20
189
313

Hmmm, I haven't noticed that function, and it do seem to work. However, I do run into some trouble when running into the second term of the function $\left(1 - y_i\right) \ln \left(1-F\left(x_i \beta\right)\right)$. For the first term ($y_i \ln\left(F\left(x_i \beta\right)\right)$) I could easily just use the `logcdf`, but in the case that $y_i = 0$ I would in principle be doing $\ln(1-\ln(F))$ which wouldn't be the same as the original. – Denver Dang Sep 04 '18 at 17:13
@DenverDang I've updated my answer. – Sycorax Sep 04 '18 at 17:32
Ah, that is brilliant. One question though, should that only be used when I know that it will be in the $1-F(x)$ term, or should it just be used in all cases ? For example, I am using `if` statements to say `if y_i = 1`, and then I would usually just do `log(scipy.stats.norm(0, 1).cdf(x))` for that and `log(1 - scipy.stats.norm(0, 1).cdf(x))` for $y_i = 0$. So should that instead be: `scipy.stats.norm(0, 1).logcdf(x)` for $y_i = 1$ and scipy.stats.norm(0, 1).logsf(x)* for $y_i = 0$ ? – Denver Dang Sep 04 '18 at 17:54
1

The way I would code it would be `y * scipy.stats.norm(0, 1).logcdf(x) + (1 - y) * scipy.stats.norm(0, 1).logsf(x)` – Sycorax Sep 04 '18 at 17:57
Thank you so much, I didn't know this existed :) – Denver Dang Sep 04 '18 at 18:08

Maximum likelihood estimation and $\log(0)$

1 Answers1

Linked