What assumptions about probability do different models make?

Question

If we throw a dice n times and get an empirical, discrete probability distribution, one could say that it approximates the real underlying distribution.

When we build a machine learning model like neural networks or logistic regression, libraries often offer predicted probabilities.

Empirically, we can evaluate them with, e.g., RMSE or other metrics compared to known cases.

However, what are the model assumptions regarding probability?

For example, has the logistic function any specific property that makes it more suitable to predict probabilities than another function? Have neural networks any specific link to probability?

To sum up: In which case can I use predicted probabilities from ML models for probability-related computations (e.g. expected value)?

score 8 · Accepted Answer · answered Jan 16 '21 at 14:42

These are great questions, and their answers can be found in the paper Why the logistic function? A tutorial discussion on probabilities and neural networks by Michael I. Jordan, 1995. I highly recommend you read it all. However, I will provide a short summary here. Note that some of what I write will not necessarily be found in the paper.

In the context of supervised machine learning, our overall objective is to classify a given signal, whether that is a 1-dimensional (audio, time-series) or 2-dimensional (image) signal, represented by $\mathbf{x}$. Suppose that $\mathbf{x}$ may belong to one of two classes: $c_1$ or $c_2$. We can then classify $\mathbf{x}$ using the following decision rule:

$$ \mathbf{x} \in c_1 \ \ \text{if} \ \ p(c_1|\mathbf{x}) > p(c_2|\mathbf{x}) \\ \mathbf{x} \in c_2 \ \ \text{if} \ \ p(c_1|\mathbf{x}) < p(c_2|\mathbf{x}) $$

This decision rule is also known as the Bayes classifier, which turns out to be the most optimal classifier. In other words, the Bayes classifier will maximize the probability of correctly classifying $\mathbf{x}$. The proof of this statement can be found on the same Wikipedia page of the link above.

Therefore, to classify $\mathbf{x}$, our objective now is to compute both $p(c_1|\mathbf{x})$ and $p(c_2|\mathbf{x})$ and compare them. Notice that:

$$ \begin{align} p(c_1|\mathbf{x}) &= \frac{p(\mathbf{x}|c_1)p(c_1)}{p(\mathbf{x})} \\ &= \frac{p(\mathbf{x}|c_1)p(c_1)}{p(\mathbf{x}|c_1)p(c_1) + p(\mathbf{x}|c_2)p(c_2)} \end{align} $$

Dividing the numerator and denominator by $p(\mathbf{x}|c_1)p(c_1)$ yields:

$$ p(c_1|\mathbf{x}) = \frac{1}{1 + \frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}} $$

Since:

$$ \frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)} = \exp\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right) $$

Then:

$$ \begin{align} p(c_1|\mathbf{x}) &= \frac{1}{1 + \exp\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right)} \\ &= \sigma\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right) \end{align} $$

Where $\sigma(\cdot)$ is the logistic function. Therefore, our new objective is to estimate $p(\mathbf{x}|c_1),p(c_1),p(\mathbf{x}|c_2),$ and $p(c_2)$, and since:

$$ \begin{align} p(c_1|\mathbf{x}) + p(c_2|\mathbf{x}) &= 1 \\ p(c_2|\mathbf{x}) &= 1 - p(c_1|\mathbf{x}) \end{align} $$

Then we do not need to estimate any other probabilities to perform Bayesian classification. However, in practice, it is very difficult to accurately estimate $p(\mathbf{x}|c_1)$ and $p(\mathbf{x}|c_2)$. Here's a better idea: why not estimate $p(c_1|\mathbf{x})$ directly? This idea is attributed to Vladimir Vapnik, and it is the basic idea behind discriminative models.

Notice that in the equation:

$$ p(c_1|\mathbf{x}) = \sigma\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right) $$

For any given problem, the right-hand side of this equation is only a function of $\mathbf{x}$. Therefore, let:

$$ p(c_1|\mathbf{x}) = \sigma(f(\mathbf{x};\theta)) $$

Where $f$ is a function that is parameterized by $\theta$. Our final objective then is to choose the function $f$ and estimate the parameters $\theta$ to perform Bayesian classification. There are many different choices for $f$. One example is a neural network, where its parameters are represented by $\theta$.

Given that $f$ is a neural network, the next step is to estimate its parameters $\theta$. In practice, given a training dataset:

$$ \mathcal{D} = \{(\mathbf{x}_1,c_1),(\mathbf{x}_2,c_1),...,(\mathbf{x}_N,c_1),(\mathbf{x}_1,c_2),(\mathbf{x}_2,c_2),...,(\mathbf{x}_M,c_2)\} $$

This is usually done using Maximum Likelihood estimation. More details can be found here.

Wow, this is a great answer, too! Thank you for the summary and for pointing to the paper. Will read it thoroughly. — Xiphias, Jan 16 '21 at 18:54
By the way, I did not mention this in the answer, but the paper I linked to by Michael I. Jordan also discussed how to maximize the likelihood function in great detail. Overall, I think it is a wonderful paper that I wish more people knew about. — mhdadk, Jan 16 '21 at 19:27

What assumptions about probability do different models make?

1 Answers1

Linked