These are great questions, and their answers can be found in the paper Why the logistic function? A tutorial discussion on probabilities and neural networks by Michael I. Jordan, 1995. I highly recommend you read it all. However, I will provide a short summary here. Note that some of what I write will not necessarily be found in the paper.
In the context of supervised machine learning, our overall objective is to classify a given signal, whether that is a 1-dimensional (audio, time-series) or 2-dimensional (image) signal, represented by $\mathbf{x}$. Suppose that $\mathbf{x}$ may belong to one of two classes: $c_1$ or $c_2$. We can then classify $\mathbf{x}$ using the following decision rule:
$$
\mathbf{x} \in c_1 \ \ \text{if} \ \ p(c_1|\mathbf{x}) > p(c_2|\mathbf{x}) \\
\mathbf{x} \in c_2 \ \ \text{if} \ \ p(c_1|\mathbf{x}) < p(c_2|\mathbf{x})
$$
This decision rule is also known as the Bayes classifier, which turns out to be the most optimal classifier. In other words, the Bayes classifier will maximize the probability of correctly classifying $\mathbf{x}$. The proof of this statement can be found on the same Wikipedia page of the link above.
Therefore, to classify $\mathbf{x}$, our objective now is to compute both $p(c_1|\mathbf{x})$ and $p(c_2|\mathbf{x})$ and compare them. Notice that:
$$
\begin{align}
p(c_1|\mathbf{x})
&= \frac{p(\mathbf{x}|c_1)p(c_1)}{p(\mathbf{x})} \\
&= \frac{p(\mathbf{x}|c_1)p(c_1)}{p(\mathbf{x}|c_1)p(c_1) + p(\mathbf{x}|c_2)p(c_2)}
\end{align}
$$
Dividing the numerator and denominator by $p(\mathbf{x}|c_1)p(c_1)$ yields:
$$
p(c_1|\mathbf{x}) = \frac{1}{1 + \frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}}
$$
Since:
$$
\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)} = \exp\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right)
$$
Then:
$$
\begin{align}
p(c_1|\mathbf{x})
&= \frac{1}{1 + \exp\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right)} \\
&= \sigma\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right)
\end{align}
$$
Where $\sigma(\cdot)$ is the logistic function. Therefore, our new objective is to estimate $p(\mathbf{x}|c_1),p(c_1),p(\mathbf{x}|c_2),$ and $p(c_2)$, and since:
$$
\begin{align}
p(c_1|\mathbf{x}) + p(c_2|\mathbf{x}) &= 1 \\
p(c_2|\mathbf{x}) &= 1 - p(c_1|\mathbf{x})
\end{align}
$$
Then we do not need to estimate any other probabilities to perform Bayesian classification. However, in practice, it is very difficult to accurately estimate $p(\mathbf{x}|c_1)$ and $p(\mathbf{x}|c_2)$. Here's a better idea: why not estimate $p(c_1|\mathbf{x})$ directly? This idea is attributed to Vladimir Vapnik, and it is the basic idea behind discriminative models.
Notice that in the equation:
$$
p(c_1|\mathbf{x}) = \sigma\left(\text{ln}\left(\frac{p(\mathbf{x}|c_2)p(c_2)}{p(\mathbf{x}|c_1)p(c_1)}\right)\right)
$$
For any given problem, the right-hand side of this equation is only a function of $\mathbf{x}$. Therefore, let:
$$
p(c_1|\mathbf{x}) = \sigma(f(\mathbf{x};\theta))
$$
Where $f$ is a function that is parameterized by $\theta$. Our final objective then is to choose the function $f$ and estimate the parameters $\theta$ to perform Bayesian classification. There are many different choices for $f$. One example is a neural network, where its parameters are represented by $\theta$.
Given that $f$ is a neural network, the next step is to estimate its parameters $\theta$. In practice, given a training dataset:
$$
\mathcal{D} = \{(\mathbf{x}_1,c_1),(\mathbf{x}_2,c_1),...,(\mathbf{x}_N,c_1),(\mathbf{x}_1,c_2),(\mathbf{x}_2,c_2),...,(\mathbf{x}_M,c_2)\}
$$
This is usually done using Maximum Likelihood estimation. More details can be found here.