Your "heuristic" that needing normalization has to do with conditional probability is wrong.
IMHO a better explanation is that some classifiers have built-in scaling, others don't.
Consider LDA:
During LDA, the data is projected so that the within-class covariance ellipsoid becomes a unit sphere. This projection heuristic removes issues with different scales between the variates. This actually does a bit more than what scaling individual variates can achieve.
But it has nothing to do with calculating conditional probabilities.
In fact, you could do $k$-nearest neighbours classification in the LD score space. Equivalently, you could use Mahalanobis-distance wrt. the overall within class covariance for your $k$-nearest neighbours.
Naive Bayes classifiers all have the built-in behaviour that variables are treated individually - this can lead to totally different reasons for why scaling may not be needed compared to the built-in projection of an LDA.
The convergence behaviour of other classifier training algorithms (e.g. SVM, neural networks) may even be sensitive not only to the relative but evenn to the absolute scale of the features for numeric reasons.