Mapping Frequentist Risk Notation to Regression

Question

The frequentist risk in literature is defined as follows:

$R(\theta, \delta) = E_{X|\theta} L(\theta,\delta(x)) = \int L(\theta,\delta(x)) p(x|\theta) dx$

This risk is focused on the quality of the estimate of $\theta$ parameter.

In a predictive regression setting, what's more relevant is the predictive ability of the regression. For example this is what OLS directly minimizes.

How do I map/rewrite the above notation so that the loss is actually the (expected) prediction error? What should be $L$? How should the conditional density be incorporated in this framework?

In some texts, a function of $\theta$, i.e. $g(\theta)$, is estimated. Could prediction error be considered a function of the parameter? The problem with this view is that $g$ seems to be a deterministic mapping, as opposed to an expectation.

Prediction error cannot itself be used as a loss, because it's a random variable. You could use *expected* prediction error for specified values of the regressors. — whuber, May 26 '17 at 12:28
It doesn't require any change at all. Simply declare $\theta$ to be the value to be predicted and let $\delta$ be the prediction. — whuber, May 26 '17 at 12:44
@whuber That approach is not very compatible with the above framework as it will require p(x|y), which is not something relevant in a regression setting. I think an approach can be to think of X as a vector (x,y), because $\delta$ will have to act on both. — Cagdas Ozgenc, May 26 '17 at 12:46
It's *perfectly* relevant in regression. There are many regression models, but a simple and very general one is that regressor values $x_1,\ldots,x_n$ are specified and independent random responses $y_i$ are observed where the marginal distribution of $y_i$ depends in a specified way on $\theta$ and $x_i$. The "$x$" in your formula refers to the random vector $(y_1,\ldots,y_n)$. The expectation of the loss is taken with respect to its distribution. — whuber, May 26 '17 at 12:52
@whuber But what if the regressors are stochastic themselves? Even if they are not how do they go into the above notation? Are they implicit constants in the distribution? Also if $\hat{\theta}$ becomes $\hat{y}$ is it also vector valued? — Cagdas Ozgenc, May 26 '17 at 12:57
It's no problem when the regressors are stochastic: then, the regression model stipulates the joint distribution of $(x,y)$ in terms of $\theta$. Your last question exhibits some confusion: $\hat\theta$ is the *parameter* estimate, not the predicted value $\hat y$. For instance, in the elementary textbook case the model is that $(X,Y)$ follows a bivariate Normal distribution with unknown mean $(\mu_1,\mu_2)$ and covariance matrix $$\Sigma=\pmatrix{\sigma_1^2&\sigma_1\sigma_2\rho\\ \sigma_1\sigma_2\rho&\sigma_2^2},$$ whence $\hat\theta=(\hat\mu_1,\ldots,\hat\rho)$ has five components. — whuber, May 26 '17 at 13:29
The first part is right: in general, "$x$" in your formula refers to the stochastic part of the model, given the parameters $\theta$. For ordinary least squares regression, $\theta$ consists of the coefficients ("slopes" and "intercept") along with the error variance $\sigma^2$. — whuber, May 26 '17 at 13:55
I never suggested $Y$ was the same as $\theta$, nor did I write or even hint that $\theta$ must be the conditional means of the $y_i$. At any rate, the nature of your question is now clearer: +1. — whuber, May 26 '17 at 14:48

score 5 · Answer 1 · answered May 26 '17 at 20:47

Because the main problem concerns applying a fully general and abstract formula to a somewhat complicated model (regression), let's address it by examining a simple concrete case. Ordinary regression is a good choice because it is well known, well understood, and serves as the archetype of all more complex regression models. But even this comes in several "flavors." The one that seems most relevant for prediction is the one in which the values of $p$ separate regressor ("independent") variables are specified by the experimenter, whose objective is to predict a random response whose distribution depends on the regressors. (As is usual, one of these $p$ regressors may take on a constant value.)

The standard notation for this is that vectors of regressor values, $x_1, x_2, \ldots, x_n$ are available (as data). They have been measured precisely along with corresponding responses $y_i$. A model for these responses is that each $y_i$ is an independent realization of a Normal variable with variance $\sigma^2$ and mean $x_i\beta$. (Each $x_i$ is a $p$-covector and $\beta=(\beta_1,\ldots,\beta_p)^\prime$ is a $p$-vector.)

Let's review: the values of the $x_i$ are known and not modeled as random variables; the values of the $y_i$ are modeled as realizations of random variables (which we could roll into an $n$-vector $y=(y_1,\ldots,y_n)^\prime$); and the values of the parameter $\theta=(\beta_0,\beta_1,\ldots,\beta_p,\sigma)$ are unknown.

Suppose the objective is to predict a response $y_0 = x_0\beta$ for a regressor $x_0$. One standard method says to predict it to be $$\hat y_0 = x_0\hat\beta$$ where $$\hat\beta = (X^\prime X)^{-}X^\prime y \tag{1}$$ and I have let $X$ be the "model matrix" obtained by stacking all $n$ of the covectors $x_i$ into an $n\times p$ matrix.

Let's pause for a moment to observe that the model and the model matrix $X$ completely determine the distribution of $\hat y_0$. This is because (a) the independence of the $y_i$ gives $y$ an $n$-variate Normal distribution; (b) its mean is given by $X\beta$; and (c) its covariance matrix is $\sigma^2$ times the $n\times n$ identity matrix.

What is not routinely specified is the loss function $L$. This measures the cost to our client when they act as if the correct value of $y_0$ is $\hat y_0$. Because it can depend on both $y_0$ and $\hat y_0$, it is formally written $L(y_0, \hat y_0)$. In the generic notation of the question, the procedure to guess $\hat y_0$ from the data is called $\delta$ and "$x$" refers to the data, which in our application are $X, x_0$, and $y$. Often it is taken to be the squared difference, $L(u,v)=(u-v)^2$. In general, loss functions might as well be zero when $u=v$ (you can't do any better than that) and they increase as $u$ and $v$ get further apart.

If you want to unwrap the preceding formulas, you could expand this out as

$$L(y_0, \hat y_0) = (y_0-\hat y_0)^2 = (x_0\beta - x_0 (X^\prime X)^{-}X^\prime y)^2.$$

Because we model $y$ as a multivariate Normal vector, this loss is a random variable. Its expectation is taken with respect to the distribution of $y$. The expected loss is the risk of our procedure. It depends on the (unknown) parameter $\theta$ and on the procedure itself. Since we're talking about a definite procedure based on equation $(1)$, it really is just a function of $\theta$:

$$R(\theta) = E(x_0\beta - x_0 (X^\prime X)^{-}X^\prime y)^2.$$

Since the right hand side is a random variable whose distribution is completely determined by $\theta$, this all makes sense and is well-defined. We could even write it out explicitly in terms of $X, x_0$ (specified constants all), and $\theta$.

Incidentally, for the "expected prediction error" referenced in the question, where $L(u,v)=v-u$, it's easy to show in this case that the risk is zero.

Thanks for taking time to write a long answer. I have two issues. Firstly I was trying to work it out for stochastic regressors. Secondly because of the ommission of stochastic regressors the Loss function is based on a particular x0. Normally one would assess the risk of this estimator over a bunch of Xs not a single point, ideally over the distribution of Xs. For a particular x0 one could devise better estimators by taking only those points within the locality of x0, which is clearly not we want in a regression setting. — Cagdas Ozgenc, May 27 '17 at 06:21
Moreover pedantically spaking in this setup your loss function doesn't have access to $x_0$ to calculate $x_0 \beta$ as you only passed it to $\delta$. Of course $\delta$ can be thought of a function with vector valued output simply emitting out $x_0$ as part of its output. But in my opinion the prediction should be a secondary expectation, hence requires a stochastic regressor setup. — Cagdas Ozgenc, May 27 '17 at 11:28
(1) Since, in the case of stochastic regressors, the risk involves only the *conditional* expectation of the prediction error, the analysis is the same. In a regression setting we make no effort to estimate the distribution of the regressors: by the very fact of calling it 'regression' we are saying our interest lies solely in the conditional expectations of the responses. (2) As explained in the preface, I gave a simple, realistic, concrete example to help you apply the general formula to any case you want. I trust you find the generalization to multiple regressors to be obvious and easy. — whuber, May 27 '17 at 12:32
I didn't understand your second comment. Your example already covers multiple regressors. My point was that in the frequentist risk template loss function is defined as $L(\theta,\delta(x))$. Your loss function takes $y_0$ which is not part of $\theta$. Even if it did not, $x_0$ is not available to calculate $x_0 \beta$ while you can calculate $x_0 \hat{\beta}$ because it is part of parameter passed to $\delta$. — Cagdas Ozgenc, May 27 '17 at 13:03
Since your problem concerns *prediction,* it makes sense and is unambiguous only once you have specified the regressor values for which you wish to make a regression. This specification of regressor values becomes part of the problem specification. It is not, however, a parameter because it requires no estimation. If you like, you may take the values of all regressors to be implicit in the formulae for loss and risk. Note that $y_0$, the response at $x_0$, *is a definite function of the parameter.* I wrote it out for you after the passage "you can expand this out as." — whuber, May 28 '17 at 14:06

Mapping Frequentist Risk Notation to Regression

1 Answers1