Predicting Y based on distribution of X

Question

Suppose I have two random variables Y and X, where Y is given as one point while X is given as a distribution. I am trying to predict Y based on X, however I cannot put the whole distribution of X in a column as I do not have one value. I could estimate some statistic(s) for the whole distribution and use only that, however I loose too much information this way. Are there some options for such cases, to be able to include more (ideally all) information about the distribution in a standard tabular form to be used in standard statistical modeling?

As an example, suppose I am trying to predict the weather tomorrow (Y), and my X is a distribution of possible values obtained through simulations. How could I include as much information as possible about X, while still keeping the column numbers to a minimum, so as to avoid high-dimensional data?

Imagine that a simulation is run each day and produces a fixed number of samples for X, let's say 1000, which depicts a probabilistic outcome of Y for tomorrow. I am able to run such a simulation because I partly understand the process which generates Y. The simulations results - the whole distribution of X - is important, since it carries a lot of information about the range, shape, modes, etc. of the possible outcomes, and I postulate that a better prediction of Y can be obtained using information about the whole distribution, rather than using just one estimate.

Note1: the distributions of X are arbitrary and do not follow any standard ones.
Note2: the values change over time and are not stationary.

Interesting question. So-called [tag:functional-data] looks at it the other way around, predicting entire (density or other) functions. One wonders whether the simplest way would be to include different functionals of $X$ in a model for $Y$, e.g., the mean, higher moments, quantiles etc. Including *all* information for arbitrary distributions of $X$ is probably not realistic. — Stephan Kolassa, Dec 15 '21 at 12:38

score 1 · Answer 1 · answered Dec 15 '21 at 16:40

Given a probability density/mass function X, the most likely value is simply the peak of the distribution. If you need to predict one value Y, this is the choice that will be correct most often. In the question, you have concerns about losing too much information if you estimate summary statistics and use only that, but note that this is exactly the problem you're trying to solve - given a distribution, you need to return one single value. No matter how you compute it, this is a "summary statistic" of the distribution. You do lose information, but that's inevitable when mapping a distribution X to one single number Y.

Suppose you run a series of weather simulations to predict the amount of rain tomorrow, and get 1000 results. The distribution of values peaks at 1cm, indicating that 1cm is the most likely amount of rain tomorrow. This means 1cm is more likely than any other value, and is therefore your best guess if you need to pick one value. Note that the range or shape of the distribution is irrelevant given the distribution's peak. It doesn't matter if some simulations predicted 0cm or 1000cm of rain, since neither of those are as likely as 1cm. Encoding the distribution of predicted rainfalls in some way and then predicting one value from the distribution is no different at all from computing a summary statistic from the distribution in the first place - in either case, you need to map a distribution to one single value.

score 1 · Answer 2 · answered Dec 17 '21 at 02:02

We really could need some more information, for instance, what does $Y$ represent, and how do you do the simulations producing $X$? The answer by @Nuclear Hoagie: assumes the simulations give a true predictive distribution for $Y$, if that is so (and you assumed squared error for the prediction) the mean of the $X$'s is the correct answer.

But if not (and how can you know?), maybe try something else. One idea is to bin $X$ and make histogram counts, then use the bin counts as predictors. Another, probably better, is to compute various descriptors from $X$, maybe mean, median, some quantiles, ... and use those as predictors. The problem looks somewhat similar to abc (approximate bayesian computation) so maybe have a look ABC. How can it avoid the likelihood function? or ABC with Lotka-Volterra (or any dynamical system).

Predicting Y based on distribution of X

2 Answers2