Sample Selection Bias in Logistic Regression

Question

I'm working on a classification problem where I expect

$True\ Positive\ Rate =0.999$

$True\ Negative\ Rate = 0.001$

To model this data, I have created a training set with an equal proportion of true positives and true negatives. I am using this data in a logistic regression model, from which I receive probabilistic classifications. These classification probabilities, however, do not reflect the distribution of the unbiased data. Is there a way to correct this bias without creating an unbiased data set and refitting?

Thank you!

EDIT: After doing some research, I have come across this reference as a starting point: Sample Selection Bias as a Specification Error

EDIT2: The above paper and its associated wikipedia page provide a means of correcting sample selection bias by regressing upon a learned model of the sample selection bias. The exact implementation, however, assumes normality of the joint distribution of the error terms. I'm not sure if this assumption holds for logistic regression.

EDIT3: The assumption of normality for the error terms in logistic regression does not hold because there is in fact no error term in logistic regression. For explanation see Logistic Regression - Error Term and its Distribution.

Side note: I'm not sure what the etiquette here is regarding answering your own question, but I suppose I'll do that and mark it as accepted.

@BenReiniger I believe you may be right – bobster345 Oct 10 '19 at 00:08 — bobster345, Oct 10 '19 at 00:08

score 0 · Answer 1 · answered Oct 09 '19 at 14:59

Sample selection bias is a common form of bias that arises, generally, through two means.

Self-Selection Bias -- For instance, when assessing the average salary of recent college graduates, those with higher salaries are more likely to report.
Analyst Selection Bias -- For instance, specifying spouses must remain married throughout the duration of a study to determine the efficacy of fertility treatments.

The problem with sample selection bias is that fitted regression functions will confound the parameters of interest with the parameters of the function causing the selection bias (Heckman 1979). The broad solution to this problem is to explicitly include the parameters of sample selection bias as regressors for the parameters of interest.

Heckman introduced a framework for doing so, known as the Heckman Correction. The Heckman Correction, however, assumes a jointly normal distribution of the error terms between the model of interest and the model of selection bias.

Logistic regression has no error terms, so the assumption of jointly normal error terms does not hold.

Instead of correcting your samples for selection bias explicitly as attempted above, other options exist.

Sampling -- If you know the true distribution of observations, then you can randomly sample from your observed distribution to match the true distribution.
Upsampling underrepresented class using SMOTE
Creating a less biased dataset -- If your selection bias arises from analyst selection bias, recreate the dataset while imposing minimal restraints on the form of the data.

Sample Selection Bias in Logistic Regression

1 Answers1