1

Assume that the conditional density of $ y \vert x $ is a Beta distribution for all values of x. Can a Beta distribution with parameters computed by a neural net, i.e. Beta($\hat{\alpha}$, $\hat{\beta}$) where $\begin{bmatrix} \hat{\alpha} & \hat{\beta} \end{bmatrix} = f(x; \theta)$ where $ f(x;\theta) $ is a neural net parameterised by $ \theta $, approximate $ y \vert x $ asymptotically? Is the following proposition true if $ f(x; \theta) $ is a neural net that can approximate any continuous function: $$ \exists \theta \ s.t. \ [ \ p(y|x) = Beta(f(x; \theta)), \ \forall{x,y} ] $$ ?

1 Answers1

2

If I understand you correctly, you have the $y_i$ values that conditionally on $x_i$ follow the beta distribution and the relationship can be described as

$$ (\alpha_i, \beta_i) = f(x; \theta) \\ y_i \sim \mathsf{Beta}(\alpha_i, \beta_i) $$

where $f(\cdot; \theta)$ is an unknown function to be estimated. Can neural networks do this? Sure. There's a beta regression that does this but uses a simpler model to approximate the function $f$. Just train it by maximizing the likelihood function defined in terms of beta distribution for the $y_i$ targets and use whatever neural network architecture to approximate $f$. One thing that you could do to simplify the model is to re-parametrize beta distribution in terms of location and precision, as beta regression does.

If you would like to learn the distribution of $y_i$'s, you would need to switch to Bayesian ground, use something like KL divergence and switch from standard neural network framework to probabilistic programming one, like PyMC3, Pyro, TensorFlow Probability, etc. So the how is a longer discussion.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • thanks for the reply. “[…] and the relationship can be described as […]” - I am not assuming that the relationship can be described as that. My question is can it be described in that way by a neural net asymptotically; the only information I have is that the quantity y conditioned on some value of the quantity x is a beta distribution. I do not intend to train a neural net to do this - my question is purely theoretical –  Jul 30 '21 at 15:29
  • 2
    @HaziqMuhammad if you want to build a mathematical model then you need to describe the relationship as some kind of function. The function can be complicated and unknown, still neural network can learn such function from the data, the same as it learns all the other complicated and unknown functions. – Tim Jul 30 '21 at 15:32
  • 1
    AFAICS Tim's reference to beta regression is correct. You can add essentially any noise model you like to a neural network, most applications stick to Gaussian or Bernoulli, but there is nothing to stop you using more complicated models. For example P.M. Williams used a hybrid Bernoulli/Gamma network for modelling rainfall (I've also implemented it, it works really well) https://proceedings.neurips.cc/paper/1997/file/56352739f59643540a3a6e16985f62c7-Paper.pdf – Dikran Marsupial Jul 30 '21 at 15:34
  • @Tim Are you saying that without assuming the existence of a continuous function with those properties, we can make no guarantees about whether a neural net with universal approximation capabilities can be used to describe the conditional density of y given x? Would it not suffice to only assume that y given x is always a beta distribution to prove that? –  Jul 30 '21 at 15:42
  • @HaziqMuhammad but prove what exactly..? To define the problem you need to mention a function. In mathematical terms, what is the relationship between $x$ and $y$ if not a function? – Tim Jul 30 '21 at 15:53
  • @Tim Assume $ f(x; \theta) $ is a neural net that outputs a 2d vector and has universal approximation capabilities and is parameterised by $ \theta $. Is there a value of $ \theta $ such that $$ p(y|x) = Beta( y; \alpha = f(x; \theta)_1, \beta = f(x; \theta)_2), \ \forall{x,y} $$ given that we only know $$ [ \ \exists{a, b} \ s.t. \ p(y|x) = Beta(y; \alpha = a, \beta = b) \ ], \ \forall{x} \ $$? –  Jul 30 '21 at 16:14
  • 1
    @HaziqMuhammad I don't understand what you are trying to say. – Tim Jul 30 '21 at 16:16
  • @Tim We know that the second proposition, i.e. the conditional density of y given x is always some beta distribution, is true. Does the first proposition follow from this? The first proposition is $$ \exists{\theta} \ s.t. \ [ \ p(y|x) = Beta(y; \alpha = f(x; \theta)_1, \beta = f(x; \theta)_2), \forall{x,y} ] $$ where $ f(x; \theta) $ is neural net parameterised by $ \theta $ that can approximate any function of $ x $ that outputs a real 2d vector –  Jul 30 '21 at 16:26
  • 2
    @HaziqMuhammad you use a neural network to *approximate* the unknown functional relationship between $x$ and $y$. The relationship itself is not a neural network, to the same extent as MNIST images were not created by the neural network, but are just scans of hand-written digits, where we can approximate the distribution of this data using a neural network. – Tim Jul 30 '21 at 16:31
  • @Tim If you have an arbitrary function, $ g(x) $, and a neural net with infinite depth and width that is parameterised by $ \theta $, $f(x; \theta)$; there is a $\theta$, i.e. weights and biases, such that $f(x; \theta) = g(x)$, right? Or have I misunderstood the whole universal approximation theorem business? :) –  Jul 30 '21 at 16:49
  • 2
    @HaziqMuhammad you seem to be missing the *approximation* part, they are not equal. NN can approximate a function. – Tim Jul 30 '21 at 17:21