Optimal prediction under squared percentage loss

Question

I have to find an answer on the following question but I am struggling:

Consider a leaf of a decision tree that consists of object-label pairs $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$.

The prediction $\hat{y}$ of this leaf is defined to minimize the loss on the training samples.

Find the optimal prediction in the leaf, for a regression tree, i.e. $y_{i} \in \mathbb{R}$, and squared percentage error loss $\mathcal{L}(y, \hat{y}) = \cfrac{\left(y - \hat{y} \right)^{2}}{y^2}$.

I tried just to take the derivative of the loss function and setting it to 0, which only yields $y=\hat{y}$ which can not be the result. Intuitively, something like the mean value of the observations present in the leaf should come out, right?

Hint: yes, if there is only a single $y$ in the terminal leaf, then of course the optimal prediction is equal to that value: $\hat{y}=y$. However, we will usually have *multiple* training samples $y_1, \dots, y_n$ in the terminal leaf, which we want to summarize using a single prediction value $\hat{y}$. Which will minimize the loss? (This will depend on how the loss is summarized over multiple training observations; you can probably assume it's averaged. Also, let's hope very much that none of the $y_i=0$.) — Stephan Kolassa, Dec 13 '21 at 07:38
Incidentally, that the answer will probably *not* be the mean of the training observations is related to [one issue with the Mean Absolute Percentage Error (MAPE)](https://stats.stackexchange.com/q/299712/1352), which is quite similar to your loss function (just without the squaring). — Stephan Kolassa, Dec 13 '21 at 07:52
Thx @StephanKolassa. Having read the article, you mean that the problem is that my loss function is not differentiable for $y_i = \hat{y_i}$? — ghxk, Dec 13 '21 at 08:20
No, your loss is quite nicely differentiable, in contrast to the MAPE. But note that there is no subscript $i$ for your prediction $\hat{y}$. You have *multiple* training instances $y_1, \dots, y_n$ in your leaf and want to summarize them with a *single* prediction $\hat{y}$. Just set up the loss function, *averaging* over the training samples, and differentiate. — Stephan Kolassa, Dec 13 '21 at 08:26
@Stephan Kolassa: What do you mean by "averaging over the training samples"? If I just differentiate the sum of my loss function with respect to $\hat{y}$, I get exactly $\hat{y} = \sum\limits_{i = 1}^n y_i/n$ — ghxk, Dec 13 '21 at 08:32
Sorry to contradict you, but no, you don't. Go through your differentiation once more. — Stephan Kolassa, Dec 13 '21 at 08:34
Sorry fore not seeing your point but, what should be wrong in my following derivation?$\frac {\partial \sum\limits_{i = 1}^n \frac{(y_i-\hat{y})^2}{y_i^2}}{\partial \hat{y}} = \sum\limits_{i = 1}^n \frac{-2(y_i-\hat{y})^2}{y_i^2} = 0 $ Multiplying both sides with $-\sum\limits_{i = 1}^n y_{i}^2$ gives $\sum\limits_{i = 1}^n 2(y_i-\hat{y})$. Solving for $\hat{y}$ gives the above mentioned. — ghxk, Dec 13 '21 at 08:52
Your differentiation under the sum is incorrect: $$\frac{d}{d\hat{y}}\frac{(y_i-\hat{y})^2}{y_i} = \frac{-2(y_i-\hat{y})}{y_i^2}$$. You have a power of $2$ that goes away when you differentiate. Can you take it from there? — Stephan Kolassa, Dec 13 '21 at 08:55
Yes, sorry that was a typo: So the correct differentiation is $\frac {\partial \sum\limits_{i = 1}^n \frac{(y_i-\hat{y})^2}{y_i^2}}{\partial \hat{y}} = \sum\limits_{i = 1}^n \frac{-2(y_i-\hat{y})}{y_i^2} = 0 $. But when I resolve this, I still get the mean....as far as I see it ;( — ghxk, Dec 13 '21 at 09:12

score 3 · Accepted Answer · answered Dec 13 '21 at 09:24

Let us assume that we have training observations $y_1, \dots, y_n$ in the leaf, all of which are nonzero. Let us further assume that we summarize losses using their sum, which is equivalent to taking their average: $$ L = \sum_{i=1}^n \frac{(y_i-\hat{y})^2}{y_i^2}. $$ To minimize the loss, we take the derivative with respect to $\hat{y}$ and set it to zero: $$ \frac{d}{d\hat{y}}L = \frac{d}{d\hat{y}}\sum_{i=1}^n \frac{(y_i-\hat{y})^2}{y_i^2} = \sum_{i=1}^n \frac{-2(y_i-\hat{y})}{y_i^2}\stackrel{!}{=}0,$$ or $$ 0= \sum_{i=1}^n \frac{(y_i-\hat{y})}{y_i^2} = \sum_{i=1}^n \frac{y_i}{y_i^2}-\sum_{i=1}^n \frac{\hat{y}}{y_i^2}= \sum_{i=1}^n \frac{1}{y_i}-\hat{y}\sum_{i=1}^n \frac{1}{y_i^2}, $$ resulting in $$ \hat{y}=\frac{\sum_{i=1}^n \frac{1}{y_i}}{\sum_{i=1}^n \frac{1}{y_i^2}}. $$

As an example, let us simulate $y_1, \dots, y_n\sim U[0,1]$ and find the optimal $\hat{y}$ both numerically and using our formula:

nn <- 100
set.seed(1)
yy <- runif(nn)

yhat_numerical <- optimize(f=function(yhat)sum((yhat-yy)^2/yy^2),interval=c(0,1))$minimum
yhat_theory <- sum(1/yy)/sum(1/yy^2)

Happily, both agree:

> yhat_numerical 
[1] 0.04473853
> yhat_theory 
[1] 0.04473853

Also, the optimal prediction is far away from the "intuitive" prediction, which is just the mean of the training samples:

> mean(yy)
[1] 0.5178471

This illustrates that the "best" point forecast depends on the error or accuracy measure (Kolassa, 2020, IJF).

Optimal prediction under squared percentage loss

1 Answers1

Linked