Do neural networks learn a function or a probability density function?

Question

The question might sound a bit strange because I am new to statistical inference and neural networks.

When in classification problems using neural networks we say that we want to learn a function $f^*$ that maps the space of the inputs $x$, to the space of the outputs $y$:

$$f^*(x; \theta) = y$$

Are we fitting the parameters ($\theta$) to model a non-linear function, or to model a probability density function?

I don't really know how to write the question in a better way. I have read several times both things (probability density function, or function just like that) so therefore my confusion.

score 24 · Accepted Answer · answered May 21 '18 at 18:46

24

Strictly speaking, neural networks are fitting a non-linear function.

They can be interpreted as fitting a probability density function if suitable activation functions are chosen and certain conditions are respected (Values must be positive and $\leq$ 1, etc...). But that is a question of how you choose to interpret their output, not of what they are actually doing. Under the hood, they are still non-linear function estimators, which you are choosing to apply to the specific problem of PDF estimation.

answered May 21 '18 at 18:46

Skander H.

10,602
2
33
81

Okey @Alex, thanks. So if for instance we have a binary classification problem, and the output layer is a sigmoid to which we apply a (0.5) threshold so that the output prediction becomes either 0 or 1, then we would be modelling a probability density function? Something like P( y={0,1} | x ) – sdiabr May 21 '18 at 21:43
3

@sdiabr actually you *wouldn't* use the threshold if you wanted the network to simulate a pdf - since a pdf can have other values besides 1 and 0. With the threshold it becomes a straightforward classifier. – Skander H. May 21 '18 at 22:45
2

The proper way to look at this is that the thresholding is a problem *external* to what is learned from the network. Not respecting this distinction causes lot's of problems in the applications of ML to real world problems. – Matthew Drury May 21 '18 at 23:04
1

Yeah okey, I understand. So forgetting aabout the thresholding, then I would be modelling a pdf ? I think i got confused with the thresholding because I read something about modelling a Bernouilli distribution. However, without the threshold, that's already Bernoilli right? In the case we have only one output node with a sigmoid activation function, that would output 0 or 1 with a probability p or (1-p) – sdiabr May 22 '18 at 07:21
1

Yes, I got confused again, thanks @CagdasOzgenc. Let's try again: By using a sigmoid function in the output layer we are directly modelling a pdf , right? following whatever distribution it may learn to follow. – sdiabr May 22 '18 at 10:15
1

Yah, that's right. You're using the network to learn the parameters of a distribution (often Bernoulli), usually by maximizing the log-likelihood of that distributional family. – Matthew Drury May 22 '18 at 14:26
2

You are not, however, learning about the probability distribution of the unknown coefficients, thus you are not learning about the posterior predictive distribution. – Brash Equilibrium May 22 '18 at 22:36
What do you mean @BrashEquilibrium ? . As Matthew said, the network is learning the parameteres of a probability distribution over Y (mapping X to Y). I don't understand what do you mean by "you are not learning about the posterior predictive distribution", isn't that the learned probability distribution over Y? – sdiabr May 29 '18 at 06:50

Cagdas Ozgenc · Answer 2 · 2018-05-22T05:57:51.107

Generally Neural Networks are not used to model complete probability densities. Their focus is to just model the mean of a distribution (or in a deterministic situation simply a non-linear function). Nevertheless it is very possible to model complete probability densities via Neural Networks.

One easy way to do this is for example for a Gaussian case is to emit the mean from one output and variance from another output of the network and then minimize $-log N(y | x ;\mu,\sigma)$ function as part of the training process instead of the common squared error. This the maximum likelihood procedure for a Neural Network.

Once you train this network everytime you plug an $x$ value as an input it will give you the $\mu$ and the $\sigma$, then you can plug the entire triplet $y,\mu,\sigma$ to the density $f(y|x)\sim N(\mu,\sigma)$ to obtain the density value for any $y$ you like. At this stage you can chose which $y$ value to use based on a real domain loss function. One thing to keep in mind is that for $\mu$ the output activation should be unrestricted so that you can emit $-\inf$ to $+\inf$ while $\sigma$ should be a positive only activation.

In general, unless it is a deterministic function that we are after, the standard squared loss training used in neural networks is pretty much the same procedure I described above. Under the hood a $Gaussian$ distribution is assumed implicitly without caring about the $\sigma$ and if you examine carefully $-log N(y|x;\mu,\sigma)$ gives you an expression for squared loss (The loss function of the Gaussian maximum likelihood estimator). In this scenario, however, instead of a $y$ value to your liking you are stuck with emitting $\mu$ everytime when given a new $x$ value.

For classification the output will be a $Bernoulli$ distribution instead of a $Gaussian$, which has a single parameter to emit. As specified in the other answer this parameter is between $0$ and $1$ so that output activation should be accordingly. It can be a logistic function or something else that achieves the same purpose.

A more sophisticated approach is Bishop's Mixture Density Networks. You can read about it in the frequently referenced paper here:

https://publications.aston.ac.uk/373/1/NCRG_94_004.pdf

Ouch you beat me to it I wanted to cite Bishop's MDNs...there is also another way to get Neural Networks to output pdfs, which is of course the Bayesian paradigm. I'll write an answer on that. — DeltaIV, May 22 '18 at 07:31
Another fun paper on mixture density networks, used to predict surfing conditions: https://icml.cc/Conferences/2005/proceedings/papers/015_Predicting_CarneyEtAl.pdf — Matthew Drury, May 22 '18 at 14:28
Should change "the entire triplet y,μ,σ" to "the entire triplet x,μ,σ" ? — moh, Sep 11 '18 at 12:37

score 1 · Answer 3 · answered May 22 '18 at 17:30

My dissenting answer is that in most impressive practical applications (those where they get the most coverage in the media, for instance) it's neither the function nor the probabilities. They implement stochastic decision making.

On the surface it looks like NN are just fitting the function, queue the universal approximation reference. In some cases, when certain activation functions and particular assumptions such as Gaussian errors are used or when you read papers on Bayesian networks, it appears that NN can produce the probability distributions.

However, this is all just by the way. What NN are intended to do is to model decision making. When a car is driven by AI, its NN is not trying to calculate the probability that it has an object in front of it, then given that there is an object to calculate the probability that it's a human. Neither it is calculating the mapping of sensor inputs to various kinds of objects. No, NN is supposed to make a decision based on all the input to steer sideways or keep driving through. It's not calculating the probability, it's telling the car what to do.

Do neural networks learn a function or a probability density function?

3 Answers3

Linked