Generally Neural Networks are not used to model complete probability densities. Their focus is to just model the mean of a distribution (or in a deterministic situation simply a non-linear function). Nevertheless it is very possible to model complete probability densities via Neural Networks.
One easy way to do this is for example for a Gaussian case is to emit the mean from one output and variance from another output of the network and then minimize $-log N(y | x ;\mu,\sigma)$ function as part of the training process instead of the common squared error. This the maximum likelihood procedure for a Neural Network.
Once you train this network everytime you plug an $x$ value as an input it will give you the $\mu$ and the $\sigma$, then you can plug the entire triplet $y,\mu,\sigma$ to the density $f(y|x)\sim N(\mu,\sigma)$ to obtain the density value for any $y$ you like. At this stage you can chose which $y$ value to use based on a real domain loss function. One thing to keep in mind is that for $\mu$ the output activation should be unrestricted so that you can emit $-\inf$ to $+\inf$ while $\sigma$ should be a positive only activation.
In general, unless it is a deterministic function that we are after, the standard squared loss training used in neural networks is pretty much the same procedure I described above. Under the hood a $Gaussian$ distribution is assumed implicitly without caring about the $\sigma$ and if you examine carefully $-log N(y|x;\mu,\sigma)$ gives you an expression for squared loss (The loss function of the Gaussian maximum likelihood estimator). In this scenario, however, instead of a $y$ value to your liking you are stuck with emitting $\mu$ everytime when given a new $x$ value.
For classification the output will be a $Bernoulli$ distribution instead of a $Gaussian$, which has a single parameter to emit. As specified in the other answer this parameter is between $0$ and $1$ so that output activation should be accordingly. It can be a logistic function or something else that achieves the same purpose.
A more sophisticated approach is Bishop's Mixture Density Networks. You can read about it in the frequently referenced paper here:
https://publications.aston.ac.uk/373/1/NCRG_94_004.pdf