3

I am a scientist fitting some binomial data, and I have been using maximum likelihood.

My model gives a probability for each datum, $L = P(y|\theta)$. The model likelihood function is not a simple logistic model, but rather it estimates $P$ over a range of hyperparameters, and returns the marginal.

I was very surprised to find that I get different estimates of theta if I maximise the likelihood of outcome 1 occurring

$\hat{\theta}_1=argmax(L_y) $

or if I minimise the estimated probability of the opposite outcome occurring

$\hat{\theta}_0=argmin(1-L_y)$.

And through experimenting I realised this is because in general,

$\underset{\theta}{\mathrm{argmax}} [ \underset{y}{\sum}\mathrm{log}P(y|\theta) ] \ne \underset{\theta}{\mathrm{argmax}} [ - \underset{y}{\sum}\mathrm{log}(1-P(y|\theta)) ] $

I thought my estimate should be symmetrical for the two outcomes. I'm sure it's something simple, so I was looking for an explanation for this online. But I did not know where to start looking; googling for "minimum unlikelihood" and suchlike did not get me very far!

edit

It seems like $\theta_1$ overweights outcome 2, and $\theta_2$ overweights outcome 1, is that right?

Sanjay Manohar
  • 906
  • 9
  • 16
  • 5
    The binomial likelihood for 0/1 data is the product over *all* the observations of $p(y|\theta)$. There's only one likelihood function for the entire sample, not separate ones for the 1's and the 0's. – Glen_b Jan 12 '17 at 11:12
  • I'm voting to close this question as off-topic because the equivalence between maximising $f(x)$ and minimising $1-f(x)$ does not sound of sufficient interest. – Xi'an Jan 12 '17 at 14:11
  • 2
    The likelihood function is NOT a probability function. That is because it is a function of the parameters for fixed values of the observations. You won't find an unlikelihood function because statsiticians do not use that terminology. – Michael R. Chernick Jan 12 '17 at 14:36
  • I'm voting to close this question as off-topic because the OP is wrongly interpreting the likelihood function as a probability function. – Michael R. Chernick Jan 12 '17 at 14:37
  • 6
    @MichaelChernick I totally disagree... this has **nothing** to do with question being off-topic. Is someone misunderstands something about statistics *this* is a place to ask the question. – Tim Jan 12 '17 at 15:05
  • 6
    @Michael Evidence of confusion in a question is often taken as a solid reason for keeping it open, not closing it! – whuber Jan 12 '17 at 15:05
  • Thanks for the comments, which eventually led me to see my logical error! (posted as additional answer) – Sanjay Manohar Jan 12 '17 at 16:37

2 Answers2

7

Expanding comment by Glen_b, binomial likelihood is

$$ L(\theta\mid n,k) \propto \theta^k(1-\theta)^{n-k} $$

where $k$ is number of successes in sample of size $n$. So if you instead look at number of failures $r = n-k$ and their probability $\xi = 1-\theta$, then you get exactly the same likelihood function

$$ L(\xi\mid n,r) \propto \xi^r(1-\xi)^{n-r} = (1-\theta)^{n-k}\theta^k $$

Tim
  • 108,699
  • 20
  • 212
  • 390
7

Thanks for the comments and answer, which eventually led me to the source of my confusion. I thought I'd share this for everyone, though it's pretty obvious now.

Let's say there are two observations $y_1,y_2$, which can be heads or tails, and I observed $HH$.

$P(HH|\theta) = P(y_1=H|\theta) \cdot P(y_2=H|\theta)$ : Probability of observing two heads

$\hat{\theta}=argmax(P(HH|\theta))$

Then I had to flip the polarity of the result, and got

$\ne argmin(1-P(HH|\theta)) $

$= argmin( (1-P(y_1==H))\cdot(1-P(y_2==H)) )$

I thought that I was minimising the probability that I would not observe $HH$.

But actually, I was minimising the probability that I would observe $TT$.

Which is obviously not the same thing, because I forgot to account for $HT$ and $TH$ possibilities!

Sanjay Manohar
  • 906
  • 9
  • 16
  • And so, in summary, it's actually *true* that the parameter set that maximises the likelihood of $\{y_i\}$, is **not** the same as the parameters that minimise the "inverse observations" $\{\neg\, y_i\}$ (each datapoint's polarity inverted). That's simply because $\{\neg \, y_i\} \ne \neg\,\{y_i\}$. – Sanjay Manohar Jan 14 '17 at 22:58