7

Density forecasts are more universal than point forecasts; they provide information on the whole predicted distribution of a random variable rather than on a concrete function thereof (such as predicted mean, median, quantile, etc.). Availability of a density forecast allows different users pick out relevant elements -- point forecasts -- that are of their interest. Some users will focus on the predicted mean, others on the predicted median, etc., depending on the loss function by which the forecast is evaluated (and which may differ from user to user). Given a density forecast, every user's needs will be satisfied regardless of the loss function, because the density forecast contains all probabilistic information about the random variable.

However, if we have a concrete user in mind and know his/her loss function, then

  • Does the density forecast provide any added value over a point forecast tailored to the loss function?
  • If the answer is No in general, what are the conditions to make it a Yes?

P.S. @hejseb draws an interesting parallel between a point forecast tailored to the loss function and a sufficient statistic; perhaps this can inspire an answer.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • Thanks to @StephanKolassa's answer including counterexamples to my initial thoughts, I have expanded the question (added the second part). – Richard Hardy Jun 18 '18 at 11:41
  • How are you generating the density forecast? – Glen_b Jun 18 '18 at 12:29
  • @Glen_b, I have not thought of that. I hope the question makes sense for generic point and density forecasts. If not, could you suggest any concretization? – Richard Hardy Jun 18 '18 at 12:55
  • Well, for example, if the density forecast is obtained by basing it on the loss function (say where $L = -log(f)$ for some $f$, specified up to a scaling constant), then that would be quite different than if it was derived in some other way. – Glen_b Jun 18 '18 at 13:20
  • @Glen_b, Different in which way, with effects on what? I suppose this has to do with what precision/accuracy one can expect from a density forecast vs. a point forecast. Conceptually, this is not the most interesting point for me (though I might be missing its importance). I could be willing to assume that the density forecast is of approximately equal quality or inferior (due to its complexity) to point forecast, in the sense of user's loss due to forecast imprecision/inaccuracy. – Richard Hardy Jun 18 '18 at 13:42
  • I mean it would result in a different answer to the question – Glen_b Jun 18 '18 at 17:14
  • @Glen_b, sorry, I still do not understand. You seem to just state the fact but not explain how it arises. – Richard Hardy Jun 19 '18 at 08:03
  • minimizing the loss function would maximize the likelihood; anything else would not. – Glen_b Jun 19 '18 at 09:51
  • @Glen_b, OK, but the user does not care about the likelihood, the user cares about the loss he/she is about to incur. I was trying to get at it in [this](https://stats.stackexchange.com/questions/311417/estimator-that-is-optimal-under-all-sensible-loss-evaluation-functions) thread, and it was pointed out to me that an estimator that is optimal under all loss functions does not exist. Hence, we need to tailor estimators to loss functions. I presume the same holds for forecasts in place of estimators. – Richard Hardy Jun 19 '18 at 10:26
  • .... and we have cycled all the way back to "how are you generating the distribution for your forecasts?" .... – Glen_b Jun 19 '18 at 13:03
  • @Glen_b, I think now I see what you are getting at. But since we need to tailor forecasts to loss functions, I am conjecturing that a density forecast is irrelevant given a point forecast if the latter is tailored to the loss function, regardless of what the density forecast is tailored to. Does that make sense? – Richard Hardy Jun 19 '18 at 13:09
  • Your discussion appears to mirror the notion of a sufficient statistic to a large degree. I think it makes a lot of sense that you don’t need the density (which collects all information) but just a (sufficient) summary of it if that’s your aim. And just like sufficient statistics are different for different parameters precisely what summary of the density you need will vary. – hejseb Jun 20 '18 at 15:43
  • @hejseb, thank you for a very interesting parallel. So, any ideas for an answer? – Richard Hardy Jun 20 '18 at 16:13
  • @Glen_b, regarding how I would generate the density forecast: I think I would tailor it to the loss function, that is only logical. – Richard Hardy Jun 21 '18 at 11:20
  • Okay, but *how*? I had planned to try to suggest some potential answers to your question but it's difficult to be explicit when the question is so vague with the details. – Glen_b Jun 21 '18 at 18:20
  • @Glen_b, The question is intentionally generic, though I understand some level of specificity is needed to make it answerable. The sources I have read (e.g. Elliott & Timmermann, 2016a) suggest that density forecasts are superfluous in presence of a known loss function, but they do not give the details. I presume this is correct and so the details should not be too difficult to work out. If you have an answer in mind, could you just lay out your assumptions and the answer itself? It would be easier to discuss then and it could help me formulate the question better if there is a need for that. – Richard Hardy Jun 21 '18 at 19:11
  • Surely it depends on the purposes of the forecasting. If I am an insurer forecasting liabilities, I'll be forecasting the mean (required in some jurisdictions -- so presumably by their lights I should use a quadratic loss in that case), but to make sure I have sufficient capital - perhaps I use calculations of VaR and TailVaR; that I have reinsurance arrangements that actually serve to improve my long term probability of ruin; or to find out the probability that I'll be able to pay a dividend to shareholders, then I need to know more than the mean. ... ctd – Glen_b Jun 22 '18 at 00:59
  • ctd... Assuming I want all those purposes to be consistent (i.e. I don't want to use different sets of predictions that may not be consistent in what they tell me so that I don't end up concluding I have a high risk of failure, requiring more capital and stronger reinsurance arrangements *and* concluding I can afford to pay a fat dividend.), then I'm interested in the distribution of future liability (spread, tail heaviness etc), not just the mean. Without context for the claim, the question seems impossibly broad. I don't see how it could possibly be a reasonable claim in general. – Glen_b Jun 22 '18 at 00:59
  • @Glen_b, All of that just defines your loss function. Once it is defined, you can find a point forecast that will suggest the same action as a density forecast. The remaining question is, can you solve for the point forecast without knowing the form of the density (this holds, for example, under square loss), or do you have to model the density first. For example, under square loss a density forecast that puts a point mass at the mean will be optimal, and of course it coincides with the optimal point forecast. Hence, there is no need for a density forecast there, given a point forecast. – Richard Hardy Jun 22 '18 at 06:07
  • How can a single point forecast simultaneously tell me my mean reserve, the 99% VaR the 99.5% TailVar, and the effect of my reinsurance on the probability of my future solvency? – Glen_b Jun 22 '18 at 06:27
  • @Glen_b, They cannot, but why would you need that? All of these things are combined into one via the loss function (which is scalar valued), so you only need to know the combination to determine which action (or equivalently, which point forecast) will bring you the smallest loss. I think I have figured it out. The question can be split into two parts: (1) Does a perfect density forecast have added value over a perfect point forecast under particular loss? and (2) What is the effect of forecasts being imperfect? I think the answer to (1) (which seems to be the more confusing part) is "No". – Richard Hardy Jun 22 '18 at 06:50
  • ...ctd: (At least in "well-behaved" situations.) The answer to (2) has been indicated in my answer below and suggests the advantage of point forecasts over density forecasts when it comes to coming up with actual (imperfect) forecasts. – Richard Hardy Jun 22 '18 at 06:53
  • I must be missing something -- I'm afraid I don't follow the first part of your response above at all. The issue I was getting at is that in practice forecasts have multiple uses and a distribution of future values can address all of them, but I don't see how a scalar forecast does so. Of course if you're allowing a vector of quantities of interest to be forecast, you might well ask "why forecast the whole distribution when you're just using it to produce these quantities?" – Glen_b Jun 22 '18 at 06:59
  • 1
    @Glen_b, I will think more and try to explain better. Thank you for your input so far! – Richard Hardy Jun 22 '18 at 07:27

2 Answers2

6

I can think of one-and-a-half more or less realistic situations where a full density is better than a point forecast, even if the loss function is known.

  • The nitpicky situation is the one where the user's loss function depends not only on the point forecast, but on a two-sided , or even the entire density, i.e., the loss function is a .

    Yes, a loss function is typically defined to depend on a single point forecast, so I'm loose with nomenclature here. Nevertheless situations like these do occur, e.g., in financial volatility forecasting. Or where I work, in retail replenishment forecasting: we may want to achieve a 95% service level, so on the face of it, we may only be interested in that (point) quantile forecast. However, a 95% quantile forecast may be 4, while we may be constrained to replenish in pack sizes of 8. In such a situation, it can be valuable to know what percentage 8 units correspond to.

  • The more relevant situation is one where we are interested in functions of predictive densities. Again, consider retail forecasting: because of the delivery schedule, our replenishment order may need to cover three days, Tuesday to Thursday. However, we forecast on daily granularity. So we may be interested in the 95% quantile forecast of the sum of the demands, and for the convolution, we need the full densities. (We could also try to forecast on three-day bucket granularity, but that becomes problematic if, say, a promotion starts in the middle of the bucket.)

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thank you for your answer. I am thinking in decision-theoretic terms as follows. A user chooses an action to maximize expected utility (negative expected loss). The choice is based on the forecast. Given a density forecast, a user can calculate the expected utility of a particular action by integrating utility of that action over the predicted density of the outcome. Then he/she chooses the action (among all possible ones) that maximizes this expected utility. If the utility function has a unique maximum (loss function has a unique minimum), the optimal action is unique. – Richard Hardy Jun 18 '18 at 11:27
  • Crucially, there exists a point in the outcome distribution that yields exactly the same expected utility as above, and that point defines the target of the "relevant" point forecast. Hence, the user will get exactly the same maximized (over all possible actions) expected utility regardless of whether the forecast he gets is a density forecast or the "relevant" point forecast, provided the quality of the two forecasts is "equally good". Does any of your examples violate this? – Richard Hardy Jun 18 '18 at 11:28
  • Perhaps your example of [an action based on] a prediction interval is a valid counterexample, though I doubt it (based on the decision-theoretic viewpoint above)... I wonder then how I could qualify the statement in the OP (add some condition to it) to make the answer "a density forecast has no added value" correct. Of course, this is not to invalidate your answer, but just for my own sake to understand the situation better. Regarding scoring rules, it would also be interesting to get an example where a scoring rule makes intuitive sense as a loss function for a particular user. (+1) – Richard Hardy Jun 18 '18 at 11:43
  • Actually, a similar argument applies to interval forecasts as to density forecasts. The argumentation in my answer suggests that for a given loss function, a interval forecast will not have added value beyond a relevant point forecast. Regarding loss functions depending on forecasts, this is possible when a user tailors his/her actions to the forecast, but an interval or density forecast can be replaced by a relevant point forecast, and the loss function would be formulated on the point forecast. – Richard Hardy Jun 20 '18 at 08:57
  • I mean the loss inevitably depends on the outcome but not necessarily on an interval or density forecast if the latter can be replaced by a point forecast. The outcome is a fundamental argument to the loss function, but the type of forecast is not if the user is allowed to choose between different types of forecasts (point vs. interval vs. density). Hence your first example is not a valid counterexample. Your second example does not seem to be valid either since as you say, we could also try to forecast on three-day bucket granularity. – Richard Hardy Jun 20 '18 at 09:03
  • Regarding my first example: I agree that we can simply forecast the profit or loss from each (here: discrete) potential action (replenish 0 packs of 8, replenish 1 pack, replenish 2 packs etc.). Insofar, you are correct. – Stephan Kolassa Jun 29 '18 at 09:40
4

Background (may be skipped)

I will be thinking in decision-theoretic terms as follows. A user must choose an action $a$ among a set of possibilities $A$. The action will bring him/her some "utility" (a notion commonly used in economics) $u(a;s)$ depending on the state of nature $s$ that will be realized in the future, where $s \in S$, a set of all possible states. (Utility is basically the negative of loss, and what follows could be reformulated equivalently either in terms of utility or loss.) The user aims at maximizing the expected utility (or equivalently, minimizing the expected loss) w.r.t. the action, $$ \max_{a \in A} \mathbb{E}_{S} u(a;s). $$

The choice of action is based on the forecast of the state of nature to be realized. Given a density forecast $\hat f_S(\cdot)$, a user can calculate the expected utility of a particular action by integrating the utility of that action over the predicted distribution of the states of nature, $$ \mathbb{E}_{\hat S} u(a;s) = \int u(a;s) \hat f_S(s) ds. $$ Then he/she chooses the action (among all possible ones) that maximizes this expected utility, $\hat a^* := \arg\max_{a \in A} \mathbb{E}_{\hat S} u(a;s)$. The expected value of utility at this action, for this density forecast is $\hat u^*:=u(\hat a^*)$.

If the utility function has a unique maximum (loss function has a unique minimum), the optimal action is unique. If the state of nature is a continuous random variable, there exists a point in the distribution (a state of nature) that yields exactly $\hat u^*$. That point defines the target of the "relevant" point forecast. Hence, the user will get exactly the same maximized (over all possible actions) expected utility regardless of whether the forecast he gets is a density forecast or the "relevant" point forecast (a unit probability mass on a certain state of nature), provided the quality of the two forecasts is "equally good" (the easiest to intuitively understand the latter is to consider the case where both the point and the density forecast are perfect).

Main part (see background for more details)

I think it is reasonable to assume that the usefulness of a forecast is fully reflected by the loss it incurs to a given user. Then the aim of a user is to choose a forecast that minimizes the expected loss. Hence, given a predicted distribution, the user will take a concrete function thereof (e.g. predicted mean) that minimizes the expected loss. The rest of the predicted density will not have any added value to the user.

If the loss function has a unique minimum, the function will be single-valued, and that value will be the point forecast relevant for the user. For example, if the user's loss function is quadratic (which has a unique minimum at the mean of the true distribution), he/she will only care about the forecast of the mean. If another user is facing absolute loss (which has a unique minimum at the median of the true distribution), he/she will only care about the forecast of the median. Providing a density forecast for either of these users in addition to forecasts of mean and median, respectively, will be of zero added value to them.

Elliott and Timmermann (2016a) write on p. 423-424 (regarding evaluation of density forecasts):

One way to [evalute a density forecast] would be to convert the density forecast into a point forecast and use the methods for point forecast evaluation. This simple approach to evaluating density forecasts might be appropriate for a number of reasons. <...> [D]ensity forecasts can be justified on the grounds that there are multiple users with different loss functions. Any one of these users might examine the performance of a density forecast with reference to the specific loss function deemed appropriate for their problem. The relevant measure of forecast performance is the average loss calculated from each user’s specific loss function.

Moreover, given a known loss function, a density forecast may even be inferior to a relevant point forecast, for the following two reasons. First, density forecasts are typically more difficult to produce than point forecasts. Second, they might trade off precision/accuracy at a particular point (say, mean or median) for precision/accuracy across the whole distribution that is being predicted. That is, if one is predicting the whole density, one might have to sacrifice some precision/accuracy for the forecast of the mean so as to get greater precision/accuracy elsewhere. As Elliott and Timmermann (2016b) write,

[T]he relationships between the scoring rules popular in the literature and the underlying loss functions for individual users is not clear. Thus, it could well be that the scoring rule used provides a poor estimate of the feature of the conditional distribution that some users wish to construct.

A similar quote can be found in Elliott and Timmermann (2016a), p. 277-278:

It would seem that provision of a predictive density is superior to reporting a point forecast since it both (a) can be combined with a loss function to produce any point forecast; and (b) is independent of the loss function. In classical estimation of the predictive density, neither of these points really holds up in practice. <...> [I]n the classical setting the estimated predictive distributions depend on the loss function. All parameters of the predictive density need to be estimated and these estimates require some loss function, so loss functions are thrown back into the mix. The catch here is that the loss functions that are often employed in density estimation do not line up with those employed for point forecasting which can lead to inferior point forecasts. <...> Moreover, conditional distributions are difficult to estimate well, and so point forecasts based on estimates of the conditional density may be highly suboptimal from an estimation perspective.

Hence, when a loss function is given, it might make sense to focus on forecasting the particular point tailored to the loss function rather than attempt to forecast the whole distribution. This might be easier to do and/or more accurate.

A critical question to myself: may it be that the "relevant" point forecast cannot be expressed as a function of the unknown density but rather be different (as a function, not just its value) for different densities? Then a density forecast would be needed to find out which point forecast one is interested in, making a density forecast an inevitable step in the point forecasting process.

References:

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • Another relevant reference is Granger and Machina (2006). According to Hall & Mitchell's chapter 5 on density forecasting in Patterson & Mills (editors) "Palgrave Handbook of Econometrics" (2009), they *show that, under conditions on the second derivative of the loss function, there is always some point forecast which leads to the same loss as if the decision maker had minimized loss given the density forecast.* Given more time, this could be incorporated in the answer. – Richard Hardy Sep 27 '21 at 09:53