In the case of supervised classification, we wish to predict the label of unseen observation $x\in\mathcal{X}$ by assigning it to some label $y \in \mathcal{Y}$. Specifically, we want to find label $y^*$ as follows: $$\begin{split} y^* &= \underset{y\in\mathcal{Y}}{\text{arg max}\;} \Pr(Y=y|X=x)\\ &= \underset{y\in\mathcal{Y}}{\text{arg max}\;} \Pr(Y=y)\Pr(X=x|Y=y) \quad\text{(by Bayes theorem)}\\ \end{split}$$
Of course, if the distribution of labels is uniform, then $\Pr(Y=y) = 1/|\mathcal{Y}|$ is constant for all $y \in \mathcal{Y}$. In such case, we can simplify the above, by dropping $\Pr(Y=y)$, into: $$\begin{split} y^* &= \underset{y\in\mathcal{Y}}{\text{arg max}\;} \Pr(X=x|Y=y)\\ \end{split}$$
And that's essentialy what any supervised classification learning algorithm aims to find. E.g. SVM, NB, etc, essentially find classification models that necessarily imply some definition of those probabilities.
Now, my question is: suppose that a suspect $x$ is to be classified whether he/she is guilty, or not guilty. Suppose that $\Pr(Y=\text{guilty}) =0.6$. Should we use this knowledge when judging on suspects? Or, alternatively, should we ignore such probability and assume that $\Pr(Y=\text{guilty}) = \Pr(Y=\text{not guilty}) = 0.5$?
My attempt:
I would imagine that dropping $\Pr(Y=y)$ is recommended in legal systems, such as courts. For example, if "theft" is a highly common crime, e.g. $\Pr(Y=\text{theft}) = 0.8$, then we must not tend to rule that suspect $x$ is a theif simply cause others tend to be thieves. In other words, we should assume that $\Pr(Y=y)=1/|\mathcal{Y}|$ for any crime $y$. Instead, all judgements against suspect $x$ should be solely based on maximizing $\Pr(X=x|Y=y)$, under the assumption that $\Pr(Y=y)$ is constant.
Any thoughts?