Could a mismatch between loss functions used for fitting vs. tuning parameter selection be justified?

Question

Could it make sense (and if so, under what circumstances) to define a penalized estimator based on one loss function but then select its tuning parameter (say, via cross validation) based on another loss function?
For example, take vanilla LASSO and use MAE (as opposed to MSE) in cross validation to select the optimal tuning parameter; or take a penalized quantile regression at the median and use MSE (as opposed to MAE) in cross validation?
I am trying to imagine a situation where this would be a logical or optimal (in some sense) thing to do. The modelling goal could be prediction, identification of the true data generating process, or yet something else; I am interested in any application that would be sensible.

_{What I am not asking about is matching the type of penalty (say, $L_1$ for LASSO, $L_2$ for ridge, etc.) with the type of loss function used to assess model performance (MAE, MSE, etc.).}

Sextus Empiricus · Answer 1 · 2019-01-11T11:27:16.170

Mismatch is not uncommon

This 'mismatch' occurs in any case where GLM is being tuned via cross validation (an example is the R function cv.glm ).

The GLM model is fitted by maximizing the likelihood function.
The nuisance parameters of the model may be tuned by a different function, e.g. $R^2$, or some other measure of performance.

The mismatch occurs whenever the objective for the fitting of coefficients (e.g. likelihood) is different from the objective for model selection (e.g. predictive power or some specialized loss function).

Why two different loss functions?

The objective for fitting of the coefficients follows the statistical model for the generation of the data and is done in order to filter the model from the noise (the first cost function weighs the noise). The objective for selecting the model may be different and is done in order to optimize some performance measure (the second cost function weights the bias).

Is the fitted model gonna perform better when we make a fit that is closer to the true data generation model (first loss function), or a fit that is closer to some performance measure (second loss function)?

Whether it is good or bad to have a mismatch between the two loss functions will depend on the relative importance of the deterministic part and the random part of the model.

The deterministic part: How close is the model to the true relationship (the amount of bias)?
The random part: How much noise is present?

Using a different loss function for the fitting than for the model selection will emphasize a better filtering of the model from the noise, this may be at the cost of less reduction in the bias in the model (as far as this bias can be reduced by choosing different coefficients).

Reversal of the question

The better question is:

Would it be justified to use the model selection (testing) objective in fitting the model coefficients (training)?

Do we get better performance when we fit the model in training to minimize the performance objective instead of some more 'natural' way to fit the model?

For instance when we have data generated according to some Poisson process then it would be 'natural' to fit it using Poisson regression. But, should we fit it with a least squares fit instead when we are (for whatever reason) measuring performance by the least squares distance?
If no or little bias is expected (ie little discrepancy between true relationship of the mean and modeled for the relationship of the mean) then it would be better to fit a Poisson regression, if you like to minimize the least squares of a prediction.

Two computational examples

The example below uses glmnet in R to compare a Poisson regression with a least squares regression.

The least squares regression 'loses' 65 times out of 100 in getting the lowest least squares error.

library(glmnet)

set.seed(1)
n=1000
p=100
nzc=trunc(p/10)

# keeping count how often the one method performs better
Poisson_vs_Gauss = 0  


for (i in 1:100) {
  # make random matrix X and independent variable Y 
  x=matrix(rnorm(n*p),n,p)
  beta=rnorm(nzc)
  fx= x[,seq(nzc)] %*% beta
  mu=exp(fx/10)
  y=rpois(n,mu)

  # perform penalized GLM in two ways 
  cvob1=cv.glmnet(x,y,type.measure="mse")
  cvob4=cv.glmnet(x,y,type.measure="mse",family="poisson")

  # compare the two
  if (min(cvob1$cvm)>min(cvob4$cvm)) {
    Poisson_vs_Gauss = Poisson_vs_Gauss+1
  }
}

plot(log(cvob1$lambda),cvob1$cvm,
     ylim=c(1,1.2),
     xlab = "log(lambda)", ylab = "MSE",
     pch=21,col="black",bg="white",cex=0.7)
points(log(cvob4$lambda),cvob4$cvm,
       pch=21,col="black",bg="gray",cex=0.7)

A problem with this example is the link function being different. I could not find standard libraries that allow computation of penalized glm with adjusted link functions.

Therefore the following example where the polynomial degree is adjusted to model some function.

The least squares regression 'loses' 64 times out of 100 in getting the lowest least squares error.

library(boot)

set.seed(1)

#data

# keeping count how often the one method performs better
Poisson_vs_Gauss_2 = 0

for (i in 1:100) {

  # create data 
  # X is a polynomial of order 1 to 5
  # Y is Poisson distributed data with a mean modeled by a sinus function.
  x <- seq(1,10,length.out=40)
  y <- rpois(length(x),5+4*sin(x/10*pi))
  datat <- data.frame(y=y,
                     x0 = rep(1,length(x)),
                     x1 = x,
                     x2 = x^2,
                     x3 = x^3,
                     x4 = x^4,
                     x5 = x^5)

  # computing performance for two models
  rms1 = rep(NA, 5)   
  rms2 = rep(NA, 5)   # init result vectors
  degree = 1:5
  for (d in degree) {
    # Gaussian model (minimizes squared error)
    fit1 <- glm(y ~ 0 + ., 
               data = datat[,1:(2+d)], 
               family = gaussian(link="identity"))  
    rms1[d] <- cv.glm(data = datat[,1:(2+d)], glmfit = fit1)$delta[1]
# Poisson model (minimizes likelihood)
fit2 <- glm(y ~ 0 + ., 
            data = datat[,1:(2+d)], start = fit1$coefficients,
                family = poisson(link="identity"))  
    rms2[d] <- cv.glm(data = datat[,1:(2+d)], glmfit = fit2)$delta[1]
  }
  if (min(rms1)>min(rms2)) {
    Poisson_vs_Gauss_2 = Poisson_vs_Gauss_2+1
  }
}

Poisson_vs_Gauss_2

plot(degree, rms1,
     ylim=c(min(rms1,rms2),max(rms1,rms2)),
     xlab = "log(lambda)", ylab = "MSE",
     pch=21,col="black",bg="white",cex=0.7)
points(degree, rms2,
       pch=21,col="black",bg="gray",cex=0.7)

This is drifting towards another question which I was going to ask but expected the answer to be too broad to be useful, so did not. The question is, for a given model with unknown coefficients, should we estimate the coefficients from the data using the same loss function as will be used in judging the model performance in application(s) ("matched" loss functions) or not ("mismatched" loss functions). An example would be OLS vs. LAD for the location parameter of $X\sim N(\mu,\sigma^2)$ where the model will be used for prediction under $L_1$ vs. $L_2$ loss. — Richard Hardy, Jan 08 '19 at 13:14
Some thoughts while I am digesting your answer. Interesting that you separate two objectives: one for fitting and another for model use. Any fitting procedure of course has an implicit objective and any use of the model has an explicit one. The existence of the implicit objective in the fitting stage does not automatically mean a researcher can sensibly justify different objectives in the two stages (it seems more intuitive to me for the two to match, or actually to just have one objective: model use, which is targeted jointly in fitting and use of the model). Thus my question. — Richard Hardy, Jan 08 '19 at 13:18
What is *closer to the true data generation model* depends on the distance metric, so Procedure A "maximizing the likelihood" might systematically yield a model that is *further* from the true model than Procedure B "minimizing some loss function that is not equal to negative likelihood". Which procedure is systematically better depends, among other things, on the distance metric. — Richard Hardy, Jan 08 '19 at 13:23
You are right that *"closer to the true data generation model"* indeed ultimately depends on the distance metric (which should relate to performance). But that does not mean that you should train the models according to that distance metric. You should in general get better results when you select a sufficient statistic for the fit. E.g. if you fit a mean of a normal distributed population then you are gonna do better with minimizing the squared error (the sample mean) rather than optimizing, say, the absolute error (the sample median), even if the performance is measured by L_1 instead L_2. — Sextus Empiricus, Jan 08 '19 at 16:10
I wonder if one could proof a theorem that states that *'for certain conditions the sufficient statistic is gonna outperform any fit based on minimizing residuals according to another distance measure, even when the performance is determined with that other distance measure'*. (Still if one does proof such result, then it does not mean that the cross validation should always be performed with the sufficient statistic, it depends on whether one can expect bias or not) — Sextus Empiricus, Jan 08 '19 at 16:25
Regarding your first comment, I totally agree and this is what I meant by *An example would be OLS vs. LAD for the location parameter of $X\sim N(\mu,\sigma^2)$ where the model will be used for prediction under $L_1$ vs. $L_2$ loss.* You phrased it well. Regarding your second comment, I now recall that I have in fact asked a related question before: ["Estimator that is optimal under all sensible loss (evaluation) functions"](https://stats.stackexchange.com/questions/311417/estimator-that-is-optimal-under-all-sensible-loss-evaluation-functions). I would love to see your take on it there! — Richard Hardy, Jan 08 '19 at 17:22
Martijn, such a great answer; may I invite you to join our [Operations Research](https://or.stackexchange.com/).SE Beta. — Rob, Aug 12 '19 at 22:09
Hi @Rob, thank you. I am afraid that I am more like an old fashioned physicist/mathematician/technologist that is not very much into the fancy stuff. The Operations Research site is too broad for me with many questions where I do not even understand many of the words that are used, let alone understand the sentences and concept of the questions (for instance questions like "Formulating a MINLP for CPLEX in PYOMO?"). I suggest, for specific questions on OR with relatively strong links to mathematical statistics, you could send a message to the chat or make a theoretic versions of the questions. — Sextus Empiricus, Aug 13 '19 at 14:17
@MartijnWeterings I noticed your *tank answer* too and thought you would be interested, but pinged from here. The question you refer to is simply: "[Mixed integer nonlinear programming](https://optimization.mccormick.northwestern.edu/index.php/Branch_and_cut_for_MINLP) using a python program (Pyomo) to send the question to a server offering IBM's CPLEX Solver" - useful for determining where to place a telescope or which ingredients to use and how to route them to your factories. Hopefully it's simply procrastination. We have quite a few doctoral mathematicians presently. Visitors welcome. — Rob, Aug 13 '19 at 15:21

Could a mismatch between loss functions used for fitting vs. tuning parameter selection be justified?

1 Answers1

Mismatch is not uncommon

Why two different loss functions?

Reversal of the question

Two computational examples

Linked