14

I am using quantile regression (for example via gbm or quantreg in R) - not focusing on the median but instead an upper quantile (e.g. 75th). Coming from a predictive modeling background, I want to measure how well the model fits on a test set and be able to describe this to a business user. My question is how? In a typical setting with a continuous target I could do the following:

  • Calculate the overall RMSE
  • Decile the data set by the predicted value and compare the average actual to the average predicted in each decile.
  • Etc.

What can be done in this case, where there really is no actual value (i don't think at least) to compare the prediction to?

Here is an example code:

install.packages("quantreg")
library(quantreg)

install.packages("gbm")
library(gbm)

data("barro")

trainIndx<-sample(1:nrow(barro),size=round(nrow(barro)*0.7),replace=FALSE)
train<-barro[trainIndx,]
valid<-barro[-trainIndx,]

modGBM<-gbm(y.net~., # formula
            data=train, # dataset
            distribution=list(name="quantile",alpha=0.75), # see the help for other choices
            n.trees=5000, # number of trees
            shrinkage=0.005, # shrinkage or learning rate,
            # 0.001 to 0.1 usually work
            interaction.depth=5, # 1: additive model, 2: two-way interactions, etc.
            bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
            train.fraction = 0.5, # fraction of data for training,
            # first train.fraction*N used for training
            n.minobsinnode = 10, # minimum total weight needed in each node
            cv.folds = 5, # do 3-fold cross-validation
            keep.data=TRUE, # keep a copy of the dataset with the object
            verbose=TRUE) # don’t print out progress

best.iter<-gbm.perf(modGBM,method="cv")

pred<-predict(modGBM,valid,best.iter)

Now what - since we don't observe the percentile of the conditional distribution?

Add:

I hypothesized several methods and I would like to know if they are correct and if there are better ones - also how to interpret the first:

  1. Calculate the average value from the loss functions:

    qregLoss<-function(actual, estimate,quantile)
    {
       (sum((actual-estimate)*(quantile-((actual-estimate)<0))))/length(actual)
    
    }
    

    This is the loss function for quantile regression - but how do we interpret the value?

  2. Should we expect that if for example we are calculating the 75th percentile that on a test set, the predicted value should be greater than the actual value around 75% of the time?

Are there other methods formal or heuristic to describe how well the model predicts new cases?

B_Miner
  • 7,560
  • 20
  • 81
  • 144

2 Answers2

3

A useful reference may be Haupt, Kagerer, and Schnurbus (2011) discussing the use of quantile-specific measures of predictive accuracy based on cross-validations for various classes of quantile regression models.

Skullduggery
  • 166
  • 2
  • 4
0

I would use the pinball loss (defined on the start of the second page of https://arxiv.org/pdf/1102.2101.pdf) and interpret it as the mean absolute error (MAE) for the quantile you are modelling, for example, let's say for an error of 100: "The mean absolute error of our model regarding the real 75%-quantile in our test data is 100."

Keep in mind this is not comparable to the RMSE as outliers are a lot less influential.

To answer your question (2): If you model the 75% quantile, you will fit the border splitting the data mass !pointwise! to a ratio of 75:25. Then approximately 25% of your test data should lie above your prediction.