How to optimize ratiometric loss function with variance term in it?

Question

I'm training a neural network (or any ML model with non-convex gradient-based optimization) to predict a continuous outcome variable. Currently, I use the mean squared error loss function, i.e., if $y$ is the true outcome and $\hat{y}$ is the model prediction, I minimize the expected loss $$\text{E}[(y-\hat{y})^2]$$

However, the expected metric I really care about maximizing is,

$$\frac{\text{E}[y\hat{y}]}{\sqrt{\text{Var}{(y\hat{y})}}}$$

Using this (or it's negative) as a loss function presents two problems:

the non differentiability of the ratio at $0$, and
$\text{Var}(y\hat{y})$ depends on the second moment of the model predictions and true outcome, so this cannot be computed for a single data point from the training or validation sets $(x_i, y_i)$.

Is there a way to approximate this loss function as a linear combination of moments of $y$ and $\hat{y}$ which I can optimize instead? In general, I am trying to get a better analytical understanding of this custom metric. How does it penalize bias and variance of the model? Where is is irregular? Is there another way to write this function that is equivalent in optimization but simpler?

Why would you want to use a single training point anyway? You can just use the sample estimates within the framework of stochastic optimization. Non-differentiability should not be a problem in practice, unless your sample consists entirely of the same pair of values. — deasmhumnha, Oct 26 '18 at 19:41
The data is not i.i.d., so a sample estimate is not easy to bootstrap. Plus, I want to be able to update parameters based on the gradient calculated at a single data point — adpbw, Oct 26 '18 at 21:00
But how do plan on calculating moments from a single point regardless? And what do you hope to gain? A single point estimate of an expectation is generally very poor. By sample estimate, I meant the batch estimate of the expectation. No bootstrapping required. The non i.i.d data isn't a problem. Both the MSE and your custom loss are just functions of the sample data and can be easily optimized in any ML framework. Have you tried using this loss in a toy problem? — deasmhumnha, Oct 27 '18 at 23:36

How to optimize ratiometric loss function with variance term in it?

0 Answers0