2

I am using polynomial regression to predict mean occupancy in a hospital unit using average length of stay (LOS) and arrival rate to the unit. I am using different percentages of training sets to train the model and 5 fold cross validation to pick the best model. For each percent of training set, the model is trained 20 different times. The final models are then used to make predictions on the test set. However, when I obtain the results of the 20 runs, the "average LOS" coefficients are highly variable (in magnitude and opposite signs) from one model to the other. Is there a way to 'quantify' this coefficient variability? or is there a way to measure 'model stability'?

Statwonder
  • 31
  • 4
  • No point doing train/test/validation splits for polynomial regression. This is a "smooth" function (unlike regression trees) so you're better off using all the data to fit your model, and make use of model averaging techniques – probabilityislogic Jul 08 '18 at 02:21

2 Answers2

3

Yes, there is. In a recent paper we present a framework for doing exactly what you describe.

The procedure is implemented in the R package stablelearner which you can find here stablelearner package.

The basic idea is to use a resampling strategy that generates pairs of learning samples that do not overlap overly much and an evaluation set that does not overlap with the learning samples. Then the model in question is fitted to each of the learning samples and based on both estimated models a pair of predictions (one from estimated model 1, one form estimated model 2) are generated for the evaluation data. For the prediction pairs a similarity measure gets calculated and aggregated over all data in the evaluation set. This is repeated B times and a summary statistic or the whole "stability distribution" is used to gauge the stability. In the paper, a 0.9*Bootstrap resampling with out-of-bag evaluation is recommended.

Polynomial regression is supported by the package (in form of the linear model). Here's a toy example:

R> library(stablelearner)
R> data(iris)
R> model<-lm(Sepal.Length~Sepal.Width+I(Sepal.Width^2),data=iris)
R> set.seed(1)
R> stab<-stability(model,control=stab_control(sampler=bootstrap,v=0.9))
R> summary(stab)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Sepal.Width^2), data = iris)

Sampler:
B = 500 
Resampling method = Bootstrap sampling with 90.0% data 
Evaluation method = OOB 

Sampling summary:
                           avg min max
Learning sample size   135.000 135 135
Learning overlap        53.080  38  68
Evaluation sample size  24.428  15  35
Evaluation overlap       0.000   0   0

Similarity summary:

, , Concordance correlation coefficient

  5%     25%    50%    75%    95%   
lm 0.0102 0.4355 0.6429 0.8169 0.9591

R> boxplot(stab)

stability boxplot

Momo
  • 8,839
  • 3
  • 46
  • 59
0

I wanted to comment but don't have enough reputation to do so. Why would I be getting NA values for the Concordance Correlation Coefficient?

I tested it with two models:

    m1 <- glm(count_formula, family="quasipoisson", data=cohort,  
              weights=weights) 

where weights is just a column of 1's and

    m2 <- glm(count_formula, family="quasipoisson", data=cohort, 
              weights=weights) 

where weights are calculated propensity scores.

Both models return NA values for the CCC when I run

    stab <- stability(m1, control=stab_control(sampler=bootstrap,v=0.9)) 

or

    stab <- stability(m2,control=stab_control(sampler=bootstrap,v=0.9))
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467