I'm training a neural network (or any ML model with non-convex gradient-based optimization) to predict a continuous outcome variable. Currently, I use the mean squared error loss function, i.e., if $y$ is the true outcome and $\hat{y}$ is the model prediction, I minimize the expected loss $$\text{E}[(y-\hat{y})^2]$$
However, the expected metric I really care about maximizing is,
$$\frac{\text{E}[y\hat{y}]}{\sqrt{\text{Var}{(y\hat{y})}}}$$
Using this (or it's negative) as a loss function presents two problems:
- the non differentiability of the ratio at $0$, and
- $\text{Var}(y\hat{y})$ depends on the second moment of the model predictions and true outcome, so this cannot be computed for a single data point from the training or validation sets $(x_i, y_i)$.
Is there a way to approximate this loss function as a linear combination of moments of $y$ and $\hat{y}$ which I can optimize instead? In general, I am trying to get a better analytical understanding of this custom metric. How does it penalize bias and variance of the model? Where is is irregular? Is there another way to write this function that is equivalent in optimization but simpler?