How do I calculate a Maximum Likelihood Parameter Estimate for this binary data - Naive Bayes

Question

I find I best learn by example but I can't seem to find any that match with this, or at most, people appear bizarrely unwilling to show where in these abstract equations you actually insert which numbers to do an actual calculation. Or make some assumptions like the domain being Gaussian that doesn't seem to apply to what I'm doing.

So I have a table of binary training data, that is, 3 input columns which I'll label x1, x2, and x3, and an output column which I'll call C. All of these simply contain yes/no (or 1/0 if you prefer) answers. Feel free to redefine these names if they collide with your preferred notation.

Here is an example table of data, I suspect this will not format nicely on mobile devices.

+-----+-----+-----+-----+
| x1  | x2  | x3  |  C  |
+-----+-----+-----+-----+
| yes | yes | no  | no  |
| no  | yes | no  | yes |
| yes | yes | no  | no  |
| no  | yes | no  | no  |
| yes | no  | yes | yes |
| no  | yes | no  | yes |
| no  | yes | no  | no  |
| yes | no  | yes | yes |
| yes | yes | no  | yes |
| no  | yes | no  | no  |
| yes | no  | no  | yes |
+-----+-----+-----+-----+

$P(C=y) = 6/11$, $P(C=n) = 5/11$

What should I do with this to produce some sort of MLE? What does that tell me about my data? Am I missing the point?

I have seen this CV.SE question, and I feel like it's not a big step to apply its accepted answer to my data but I just can't connect it with this data set, I need to see an example of it being used.

Thank you very much. :) If there's any further info you need do comment and ask, I'm new to both CV and fairly new to this sort of thing in general

Welcome to CV! Please do not get discouraged by my comment, I hope it's constructive :) I find some parts of your question confusing. E.g., I believe the features in Naïve Bayes are usually assumed to be just independent, not i.i.d. Also, what exactly do you mean by (=)=11? Finally, if you want an MLE of parameters, at the very least you need to define what your parameters ARE. — Andris Birkmanis, Feb 25 '20 at 04:28
Hi, thanks! Whoops, some bad formatting there. I will fix/remove some stuff there. I was under the impression that having all the P(x|C) values was sufficient but I'll generate a table of sample data instead and replace this. I'll file that under me missing the point — ch4rl1e97, Feb 25 '20 at 04:39
If you only have 11 data points, your estimates are likely way off. That said, looks like you already did MLE of P(C=y) - you may want to consult the link you referenced before to see which other parameters you are supposed to estimate. Hint: those are conditional probabilities. After that, it's a matter of plugging them all together. — Andris Birkmanis, Feb 25 '20 at 05:39
Yeah I realise 11 points isn't going to give good results, this is merely a toy problem to understand the methods. (The actual problem only proves 20 points lol) P(c=y/n) is just total yes/no divided by total count. Is that literally all it is? The class has been poorly taught (my opinion, obviously) and MLE never came up, at least not with that exact name. I can calculate stuff like P(x1=yes | C=yes), is that what I'm after? I did that already, and turned those into tables, should I add it into the question? how do I plug those together, or would those tables be sufficient? — ch4rl1e97, Feb 25 '20 at 07:02
Apologies for the notification. So I think these tables are what I'm after? one table per input column. https://i.imgur.com/TNUlEy5.png (Note they're for a different, though similar, set of data so the numbers aren't quite right for the above table). Is that simply it? Are those the MLE? Obtaining those tables is as far as my classes went though they weren't really named "MLE" as such, just "ML" was vaguely mentioned. I feel quite stupid :( If not are you able to tell me the next step? Thanks again Andris. I'll write an answer using the sample table I gave if this is correct. — ch4rl1e97, Feb 25 '20 at 13:03
Assuming E, I, and R correspond to x1, x2, and x3, you are on the right track. Just plug your conditional probabilities into the Bayesian equation. You may either normalize or just compare odds, depending on what is the end goal of this exercise. — Andris Birkmanis, Feb 26 '20 at 05:41
Yes x1->E etc. I've generated this full table with all the numbers shoved back in and then a likelihood calculated for each row. https://i.imgur.com/F8rHJrI.png I assume the last step then is to multiply down that column to get a singular value? Though this feels strange, does this number actually tell us anything? Feels like performing a machine learning task and the evaluating it with the same training data :/ I do have test data however this comes up in an entirely separate section of the assignment and isn't to be used here. — ch4rl1e97, Feb 26 '20 at 16:54
You are almost there. It feels strange for two reasons: you are reusing your training data for testing, and you are treating the test label as an input instead of an output. You can avoid the first problem by splitting your data into separate training and testing sets. That would reduce the amount of training data even more, but should be OK for building up understanding. To fix the second problem, calculate likelihoods for both classes of label and pick the one with the highest one as your prediction. For bonus points, calculate the ratio of the two likelihoods for every testing example. — Andris Birkmanis, Feb 27 '20 at 06:01
I have previously used actual ML software (sklearn etc.) very successfully and am aware of test splits etc. I'm more familiar with that sort of thing than the raw statistical stuff here This entire assignment just seems weird. The question was worded roughly as "Provide the maximum likelihood parameter estimate for a Naive Bayes model given the training data in Table 1". Multiplying across all in/outs the row seems to be how an example from the Prof gets a Maximum Likelihood so I'll leave it now. I have tested it with the data in a different table also (It performed really badly, haha) — ch4rl1e97, Feb 27 '20 at 20:29
OK, you can skip the split, if you know what you are doing and you have no more data (though I still would not do it - sure, your estimates become even less accurate, but you get a better intuition of what is going on). But definitely calculate likelihoods of both labels if you want to make a prediction. If you do not need to make a prediction (as the wording of the task seems to imply), then you do not need even one likelihood - you had the MLE of your parameters at the moment you divided counts. — Andris Birkmanis, Feb 28 '20 at 05:35
Thanks for your continued patience with me Andris! Glad to know I managed to get what I needed! Ran out of time and had to submit what I had done, so I'm pleased to see I accidentally did what I had to do. Thanks again, friend <3 To be clear I did what you said in a latter question, where I calculated both labels for the test data! — ch4rl1e97, Feb 29 '20 at 16:22

score 1 · Answer 1 · answered Feb 28 '20 at 18:42

The goal is to obtain the conditional distribution $P(C|x_1, x_2, x_3)$, which we model as \begin{align*} P(C|x_1, x_2, x_3) &= \frac{P(x_1,x_2,x_3|C)P(C)}{P(x_1,x_2,x_3)} && \text{(Bayes rule)} \\[1.2ex] &= \frac{P(x_1|C)P(x_2|C)P(x_3|C)P(C)}{P(x_1,x_2,x_3)} && \text{(the Naive Bayes assumption)} \\[1.2ex] &= k P(x_1|C)P(x_2|C)P(x_3|C)P(C) \end{align*} The constant $k$ can be ignored, and computed later, by noting that $$P(C=\text{yes}|x_1,x_2,x_3) + P(C=\text{no}|x_1,x_2,x_3) = 1.$$

Application to toy data

First we have $$P(C=\text{yes}) = \frac{6}{11} \quad\quad P(C=\text{no}) = \frac{5}{11}.$$

The distribution of each $x_i$, conditional on $C$ can also be obtained from the data.

$$P(x_1=\text{yes}|C=c) = \begin{cases} \frac{4}{6}, & c=\text{yes} \\[1.1ex] \frac{2}{5}, & c=\text{no} \end{cases}$$

$$P(x_2=\text{yes}|C=c) = \begin{cases} \frac{3}{6}, & c=\text{yes} \\[1.1ex] \frac{5}{5}, & c=\text{no} \end{cases}$$

$$P(x_3=\text{yes}|C=c) = \begin{cases} \frac{2}{6}, & c=\text{yes} \\[1.1ex] \frac{0}{5}, & c=\text{no} \end{cases}$$

Now you would like to predict the class of a new instance with the data ($x_1=\text{yes}, x_2=\text{yes}, x_3=\text{yes}$). To do this, compute

$$P(C=\text{yes}|x_1=\text{yes}, x_2=\text{yes}, x_3=\text{yes}) = k\frac{4}{6}\frac{3}{6}\frac{2}{6}\frac{6}{11} = k(0.0606)$$ $$P(C=\text{no}|x_1=\text{yes}, x_2=\text{yes}, x_3=\text{yes}) = k\frac{2}{5}\frac{5}{5}\frac{0}{5}\frac{5}{11} = 0$$

Indicating that the the class is yes with probability one. The problem is, that with such a small data set, you will see some $0$s and $1$s in the conditional distributions $P(x_i|C)$. To account for this, we can use Laplace smoothing which involves adding $\epsilon$ to the numerator and $d \epsilon$ to the denominator (where $d=2$ is the number of classes) of the conditional distributions. Setting $\epsilon = 1$ we would obtain

$$P(C=\text{yes}|x_1=\text{yes}, x_2=\text{yes}, x_3=\text{yes}) = k\frac{5}{8}\frac{4}{8}\frac{3}{8}\frac{6}{11} = k(0.0639)$$ $$P(C=\text{no}|x_1=\text{yes}, x_2=\text{yes}, x_3=\text{yes}) = k\frac{3}{7}\frac{6}{7}\frac{1}{7}\frac{5}{11} = k(0.0239)$$

Choosing $k$ so that these sum to $1$, we get that this instance is class "yes" with probability $0.73$.

Thanks a lot, great explanation plus detailed worked example, unfortunately I can only pick one answer! +1 though — ch4rl1e97, Feb 29 '20 at 16:26

score 0 · Accepted Answer · answered Feb 28 '20 at 17:45

This answer assumes you only want to estimate the parameters of a model that predicts $C$ from $x_i$.

We are modeling a conditional distribution $P(C|x_1,x_2,x_3)$.

The Naïve Bayes (NB) assumption is a conditional independence of predictors given the result: $P(x_1,x_2,x_3|C) = \prod_i{P(x_i|C)}$.

This can be used to calculate the goal distribution using Bayes theorem: $P(C|x_1,x_2,x_3) = P(x_1,x_2,x_3|C)P(C)/P(x_1,x_2,x_3)$.

Combining the two equations, we get: $P(C|x_1,x_2,x_3) = P(C)/P(x_1,x_2,x_3)\prod_i{P(x_i|C)}$.

This model has these parameters to be estimated: $P(C)$ and $P(x_i|C)$.

Technically, there is also $P(x_1,x_2,x_3)$, which you need in order to normalize and get actual conditional probability. If you are only interested in prediction, you can get away with comparing non-normalized likelihoods, then you do not need this parameter. Another way to see this is modeling a joint distribution instead: $P(C,x_1,x_2,x_3) = P(C)\prod_i{P(x_i|C)}$.

The MLE for these parameters is obtained simply by taking the ratio of counts (as you did, at least for some).

TLDR: At the moment you divided counts, you performed MLE, and if that was all required, you are done.

If, however, you want to perform predictions, you can calculate either conditional probabilities (more computation) or likelihoods (and compare them or take their ratio).

Thanks a lot! Had two great answers but I'll accept yours given the help you gave me over the last few days — ch4rl1e97, Feb 29 '20 at 16:24

How do I calculate a Maximum Likelihood Parameter Estimate for this binary data - Naive Bayes

2 Answers2

Application to toy data