1

I have a logistic regression model for conversion rate after sending a customer a voucher. The dataset consists of all previous occasions where customers were sent a voucher, with the response whether they used the voucher or not.

One of the predictors in the model is a recency-weighted calculation of their spend, on previous occasions where they were sent, and used a voucher, for example,

$$\text{previous spend} := \sum_d \frac{1}{d} s_d,$$

where the summation is taken over days $d$ previous to today with a non zero transaction for the customer, with $s_d$ equal to how much was spent on that day while using a voucher. As an example, if a customer had 3 previous transactions, at weekly intervals before today, for $1 each, their previous spend is calculated as:

$$\frac{1}{7} + \frac{1}{14} + \frac{1}{21}$$

I am unsure whether I have done the recency weighting correctly, in the sense that while this variable is significant in the regression, I am unsure whether it will remain so on new data.

Thus I would like to use cross validation on my model. I am unsure how to calculate this variable in this case, should I:

  1. Pre-calculate the values for $\text{previous spend}$, and calculate model parameters and fit via usual cross validation.

  2. Re-calculate the values for $\text{previous spend}$ on the train and test set for each cross validation fold, before model fitting and evaluation via usual cross validation.

  3. Some how include the historic averaging as a model parameter, so that it can be cross validated directly.

Please throw some light on the advantages/disadvantages of the methods listed above, or any other you think is suitable.

Alex
  • 3,728
  • 3
  • 25
  • 46
  • Just to clarify, your data set has multiple records for each customer (each time a voucher is sent), and also multiple customers? Also, do you want to use the model on new customers who have not received a voucher (its unclear how you would do so with this setup)? – Matthew Drury Mar 01 '16 at 00:28
  • yes, multiple records for each customer and also multiple customers. The model will not be used on new customers. – Alex Mar 01 '16 at 00:29
  • 1
    Thanks. This is a nice question that I don't have an immediate answer for. I'll think about it, but in the meantime hope someone smarter comes along! – Matthew Drury Mar 01 '16 at 00:35

1 Answers1

0

This is an excellent opportunity to bootstrap the regression coefficient. This will get you from A to B more directly than if you were to use cross-validation.

Think of bootstrapping as the following:

  • Take a random sample (this could be of transactions or of people: you should make this judgement call after considering both)
  • Estimate the regression coefficient associated with the logistic regression

At the end of this process you will have a distribution of beta values. From this you can calculate a confidence/credible interval.

Programmatically, this can be accomplished using the boot package in R, or you could merely perform the sampling in a for loop.

josiah
  • 46
  • 4
  • thank you for your answer. I am not quite sure how bootstrapping helps to measure predictive performance of the model, even after reading http://stats.stackexchange.com/questions/18348/differences-between-cross-validation-and-bootstrapping-to-estimate-the-predictio – Alex Mar 01 '16 at 02:15
  • however, your methodology does not cover my fundamental question: that is, when taking random samples (either for bootstrapping or cv purposes), what value should you use for the $\text{previous spend}$. Do you recalculate this based on the sample you have, keep the precalculated values, or??? – Alex Mar 01 '16 at 02:17