How to avoid selection bias while updating lead scoring (predictive) model with new data

Question

We developed a standard lead scoring model using logistic regression on couple of months worth data. The model has been working and we have been pushing only top 1/3 leads to sales team basis that. The model is giving around 40% lift.

This model is already 2.5 months old and we are planning to update / retrain the model after adding new data along with conversion results.

I am concerned that since we were only pushing qualified leads to sales team, we do not have the conversion result for low quality leads and hence they will have to be excluded from the model but this in turn would mean that the model will get trained on a dataset which is systematically different from ground reality - how to fix this?

Is it too expensive to train the model from scratch including low quality leads? — Adrian Keister, Oct 07 '21 at 18:51
@AdrianKeister: I imagine that kind of defeats the entire purpose of having a model in the first place, which is to detect the high quality leads and to concentrate resources (sales people's time) on them. Still, it may well be that at least a few low quality leads need to be "seeded" into the selection. Ideally without telling the sales team - otherwise, they will ignore the low quality ones. — Stephan Kolassa, Oct 07 '21 at 18:53
@StephanKolassa I would think that if the purpose is to have a scoring model, you would absolutely need to have a representative dataset for training the model. If the model needs to be able to score low quality leads (surely that information is of value?!?), then the training data should have low quality leads. — Adrian Keister, Oct 07 '21 at 19:12
@AdrianKeister, Stephan thanks for the comments. I guess the only feasible way is to use few low quality leads in actual sales and then over sample them during retraining - though not ideal, we will have to live with this. If anything else comes to mind - please do share!! — Deepak Agarwal, Oct 08 '21 at 04:18
@AdrianKeister: yes, absolutely, we should have a representative dataset for training. But then the model gets *applied* (which is why people will go to the trouble of setting up a model in the first place) - at which point, if the goal of the model was to detect and only follow up high quality leads, the composition of the *new* data will change. And then, of course, retraining will suffer from bias. — Stephan Kolassa, Oct 08 '21 at 05:13
@DeepakAgarwal: oversampling won't be much help. It simulates knowledge you don't have, and your model will be too sure of its conclusions. Also: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Oct 08 '21 at 05:14
@kjetilbhalvorsen: a *sales lead* is an opportunity to make a sale, usually in a business-to-business context. Essentially, your sales people are looking for potential customers who may be interested in your product or service. Such a potential customer is a *lead*. But not all leads turn into actual sales, and nurturing a lead requires investment, e.g., in terms of your sales team's time and effort. So the idea is to create a model that will predict which leads have a high chance of materializing, in order to concentrate your efforts on them. — Stephan Kolassa, Oct 08 '21 at 19:06
A similar issue https://www.unofficialgoogledatascience.com/2017/01/causality-in-machine-learning.html — seanv507, Oct 08 '21 at 22:20
I would agree with oversampling. Perhaps something like smote can be used to avoid the overconfidence issue. — seanv507, Oct 08 '21 at 22:26

Sextus Empiricus · Answer 1 · 2021-10-08T22:16:55.343

You could combine the models.

Let the refined/upgraded/updated model only make decisions for the type of data for which it has been trained.

Let the old model still select the top 1/3 (because the new model doesn't now much better about the bottom 2/3)
Let the new model remove some from that 1/3. (Which it should do well and without selection bias)
Then have the old model add some extra for the amount that the new model removed.

And you can keep updating the model that works on the top 1/3.

To update the model that looks at all cases and selects the top 1/3 you need to make sure that it can train on more data. You do not have to add so much for this. You want anyway the model to fit the top very well and not the bottom. The model isn't gonna become extremely better in the bottom and suddenly place low cases in the top 1/3.

Interesting approach! Definitely worth looking into. @Sextus thanks!! — Deepak Agarwal, Oct 10 '21 at 18:09

score 0 · Answer 2 · answered Oct 20 '21 at 15:17

This specific flavor of selection bias is called rejection bias, at least in credit decisions. Searching for that term should find some useful information.

I know of two ways proposed to mitigate the effects:

Passthroughs. Randomly select some of the model-rejects to actually approve (and thereby eventually get labels and be able to include in future retraining). This was mentioned in the comments, and I agree that oversampling/upweighting these passthroughs in the model training makes sense: you want to inform the model that the real distribution has more of "these kind" of datapoints. However, in a test at my company we found that the optimal weighting was much less than what would bring the passthrough sample back to its true proportion, perhaps because we didn't pass through enough to be fully representative?
Reject inference models. Use your old model (or if available, some third data source) to score the goodness of the rejected population, and use that as a proxy for actual performance to include them in the retraining dataset. (There are several ways to specify the proxy: depending on the model, you can use probabilities instead of hard classes, or you can flip score-weighted coins, or duplicate the samples once with each class and give them weights equal to the scores, etc.) There are a few papers on this, and the benefits are debated. I think the balance to strike is between retaining the good information from the old model while forgetting the bad (or no longer relevant) information. See also What is "reject inferencing" and how can it be used to increase the accuracy of a model?

Passthroughs obviously cost something to gather. A colleague of mine did an analysis on a historical chunk of our data where we had opened up quite a bit (so he could set a more business-as-usual threshold to simulate what a passthrough would look like). For us, a reject inference model did improve the retrained model's score, but not nearly as much as a passthrough did, and the cost of the passthrough was dwarfed by the improvement of the retrained model. But I expect all that depends greatly on business specifics.

The answer by Sextus Empiricus feels like it's similar to a reject inference: you trust the old model to rank-order the whole population well, and in particular you trust its performance on the lower-quality datapoints and incorporate the new information from the higher-quality approved datapoints.

How to avoid selection bias while updating lead scoring (predictive) model with new data

2 Answers2

Linked