Minimizing the zero-one loss
$$
L(\theta, \widehat{\theta})
= 1 - \mathbf{1}_{\{\theta\}}(\widehat{\theta})
= \begin{cases}
0 & \text{if $\theta = \widehat{\theta}$} \\
1 & \text{otherwise}
\end{cases}
$$
is the same as maximum likelihood estimation with a family of Dirac distributions: discrete distributions with probability mass function
$$
p(x \mid \theta)
= \mathbf{1}_{\{\theta\}}(x)
= \begin{cases}
1 & \text{if $x = \theta$} \\
0 & \text{otherwise.}
\end{cases}
$$
Here $\mathbf{1}_A$ denotes the indicator function of a set $A$.
Regarding your beliefs about zero-one loss:
MAP is Bayes optimal for zero-one loss
Consider a parametric family of distributions $\{P_\theta : \theta \in \Omega\}$ and a prior $\pi$.
The risk of any estimator $\widehat{\theta}$ of $\theta$ is
$$
\begin{aligned}
R(\theta, \widehat{\theta})
&= E_\theta[L(\theta, \widehat{\theta})] \\
&= E_\theta[1 - \mathbf{1}_{\{\theta\}}(\widehat{\theta})] \\
&= P_\theta(\widehat{\theta} \neq \theta),
\end{aligned}
$$
and so the Bayes risk is
$$
\begin{aligned}
R_\pi(\widehat{\theta})
&= \int_\Omega R(\theta, \widehat{\theta}) \, d\pi(\theta) \\
&= \int_\Omega P_\theta(\widehat{\theta} \neq \theta) \, d\pi(\theta).
\end{aligned}
$$
A priori it is not clear to me why a MAP estimator $\widehat{\theta}$ would minimize this Bayes risk in this general setup.
Logistic regression models probabilities directly, and classifies to the largest, but its parameters are optimized based on maximum likelihood (leading to cross-entropy) rather than optimizing zero-one loss directly
This is correct.
The performance of logistic regression could be improved by using zero-one loss, or you could ditch logistic regression entirely and use frequency tables to approximate P(class|X), though both of these options are entirely intractable
Zero-one loss is not differentiable and not convex, so the tradiitonal optimization methods used to train models like logistic regression don't apply to it.
The problem with frequency tables is that as soon as you have a point X that isn't in the training data, you can't make any inference about P(class|X) as you would like.