Peformance metric when not only accuracy is important but also standard deviation

Question

To put in context, I'm using a Cox PH model in the area of Survival Analysis using lifelines package to predict when a customer will do something, if that even happens.

The only metrics included in lifelines is the concordance index or log-likelihood.

As far as I know, the concordance index would be the equivalent of a ranked correlation for censored data, or a ROC AUC (I've seen both interpretations).

I wanted to know if there is any metric that would be useful to measure both accuracy (in therms of the event ever happening or not) and "precision" (in the sense of, calculating the deviation of the mean for the well-predicted classes).

The output of my model is the median prediction in days for the event to happen.

Does such a performance metric exist?

I thought maybe checking the concordance index by groups (let's say, those who cured by 10 day, between day 10 and 20, day 20 and 30, etc) to make it a bit more precise... but I'm not sure that's the way to go. Maybe even the MAE in conjunction with the accuracy to get a better picture?

Here is a bit of code so you get the idea:

### Between 20 and 30 days ###

test_20and30 = imputed_df.loc[(imputed_df['length_of_arrears']>20) & (imputed_df['length_of_arrears']<=30)]

cph.predict_median(test_20and30)  
Out[480]: 
17       8.0
30      15.0
40      27.0
49      11.0
55       6.0
67     423.0
88      11.0
126     20.0
146      7.0
148      6.0
150     11.0
169     14.0
186      8.0
190     12.0
204     10.0
215     28.0
242     15.0
282     15.0
287      7.0
299      9.0
308     14.0
325      9.0
357     98.0
364     21.0
Name: 0.5, dtype: float64

test_20and30.cured
Out[481]: 
17     1.0
30     0.0
40     0.0
49     0.0
55     0.0
67     0.0
88     1.0
126    1.0
146    0.0
148    1.0
150    0.0
169    1.0
186    0.0
190    1.0
204    1.0
215    0.0
242    1.0
282    0.0
287    0.0
299    0.0
308    1.0
325    0.0
357    0.0
364    1.0
Name: cured, dtype: float64

test_20and30.length_of_arrears     
Out[482]: 
17     22.0
30     21.0
40     28.0
49     26.0
55     24.0
67     28.0
88     21.0
126    27.0
146    26.0
148    24.0
150    27.0
169    23.0
186    26.0
190    22.0
204    26.0
215    26.0
242    30.0
282    23.0
287    25.0
299    27.0
308    22.0
325    26.0
357    27.0
364    27.0
Name: length_of_arrears, dtype: float64

# concordance index

cph.score(test_20and30, scoring_method='concordance_index')   
Out[484]: 0.60431654676259

cured_ornot = pd.DataFrame(index=cph.predict_median(test_20and30).index)

cured_ornot['cured']=0

cured_ornot.loc[((cph.predict_median(test_20and30) > 20) & (cph.predict_median(test_20and30) <= 30 )),'cured']=1

print(classification_report(test_20and30.cured, cured_ornot))
              precision    recall  f1-score   support

         0.0       0.57      0.86      0.69        14
         1.0       0.33      0.10      0.15        10

    accuracy                           0.54        24
   macro avg       0.45      0.48      0.42        24
weighted avg       0.47      0.54      0.46        24

score 1 · Answer 1 · answered Nov 27 '20 at 16:01

1

Something like what you want is provided by a calibration curve for a survival model. For a particular point in time, you see how well the event probabilities predicted by the model align with the "observed" probabilities. As the events are all-or-none there has to be some pooling of information among cases to get "observed" probabilities.

This page outlines a way to use regression-spline interpolation to allow for nonlinearities, interactions, and censoring to get those observed probability estimates on a continuous scale, so that you don't have to bin the data arbitrarily. The mean absolute error between observed and predicted in the probability scale can provide a useful summary of the fit, although the shape of the calibration curve can provide more detailed information about regions where the model is performing poorly.

This method is implemented, along with resampling to correct for optimistic overfitting, in the calibrate function of the rms package in R.

answered Nov 27 '20 at 16:01

EdM

57,766
7
66
187

Thank you for this. I was able to calculate the calibration curves with Python, both in the `lifelines` package and with sklearn directly as shown here (https://towardsdatascience.com/churn-prediction-and-prevention-in-python-2d454e5fd9a5). For some reason the result is simtric and opposite, but I won't give that much thought. However, the result shows that it's not very well calibrated, so I'm wondering if you'd have any resource that would show how to change the model to improve the calibration stage? – amestrian Nov 30 '20 at 12:15
If you think it's necessary, I can make another post with pictures of the curves – amestrian Nov 30 '20 at 12:17
@amestrian poor calibration can come from many types of things: non-linearities or interactions not accounted for by the model; variables associated with outcome but not included in the model (omitted-variable bias), and overfitting (too many predictors per event). In addition to many threads on this site, [Frank Harrell's site](http://hbiostat.org) provides links about best regression modeling practices. You could also try approaches like boosted trees: perhaps less understandable to humans but sometimes providing better predictive performance. – EdM Nov 30 '20 at 13:44
I tried random forests and xgboost before, a bit differently (had to create many variables that would be "cured in X days", because as far as I know, it's not possible to account for curing AND time-to-event there, so I treated it as a multiclass classification problem), and the result was very bad for some of the classes... – amestrian Nov 30 '20 at 14:50
Actually, I carefully checked all the reasons why this could be, that you wrote before, and I realized I was running a penalized model. I ran it again without the penalization and with the `lifelines` calibration formula, it's looking very good for the train data, and okay with the test data. It's strange, however, that if I try it with the calibration curves from sklearn (code can be found in the link I send in my first comment), the difference is substantial. Would you have any particular insight in that? – amestrian Nov 30 '20 at 14:56
@amestrian I don't have any experience with the Python systems. Although questions specific to software platforms are off-topic on this site, a question about the relative _statistical_ advantages of `lifelines` versus `sklearn` calibration approaches might be deemed on-topic if worded carefully. The author of `lifelines` does visit this site and commented on [this question](https://stats.stackexchange.com/q/496633/28500) that you posed. – EdM Nov 30 '20 at 17:09

Peformance metric when not only accuracy is important but also standard deviation

1 Answers1

Linked