1

I want to generate new leads for a small business banking company and I have decided to use logistic regression with a binary outcome (1=customer of the bank, 0=not a customer of the bank). All the records (people with small businesses) for the 1s were provided by the bank itself from their database. The dataset contained all categorical variables only initially. A match algorithm was run against our own open+public data db with those records from the bank and then we derived some further numerical count information (interval and ratio variables) about those records that matched with our db.

Now to get the 0s (not a customer), I decided to take a subset of small business owners from our own db. However, since I have no idea about population sizes, I decided to use a balanced sample (50% 1s, and 50% 0s) for the modelling process.

I wish to know how this would impact the lift chart and is the model going to be any good even if the lift chart looks convincing? Model accuracy [(TP+TN)/(TP+TN+FP+FN)] is approximately 78-80% with the current approach, however I highly doubt that number.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
pborah
  • 11
  • 3

1 Answers1

0

This is not at all the right approach. Suppose you take 90 % 1s. there are chances that your true positive rate would improve. As there are more true values. Thus this approach is incorrect,

You must have idea of actual ratio of both classed and also i doubt if mixing data from 2 sources would be related or not?

Arpit Sisodia
  • 1,029
  • 2
  • 7
  • 23
  • I know what you're talking about. I just want to know how does it affect the lift chart when evaluating the model? E.g. How does lift chart change if I just change the sampling ratios for the 1s and 0s as 50-50, 40-60, 10-90, etc and not touch anything else??? – pborah Feb 08 '17 at 16:34
  • Lift is a measure to know the improvement in any process considering (using) LR results. So based on LR result at 50-50,10-90 lift would change. Lets say scenario 1- if you have taken 50-50 (1,0), random prediction would be 50% accurate.Here you say 85% accuracy so lift is good. Scenario 2 - you have 90-10(1,0), randomly you would be 90% accurate saying all r 1s. at max model would be 100 correct(hypothetically),here you can't hav lift as good as in 50-50 scenario as it was 90% accurate without any model. so lift chart would change, it may give wrong result if v ignore actual population ratio. – Arpit Sisodia Feb 12 '17 at 04:02
  • Very helpful TY! The dataset has around 30 attributes mostly categorical & over 300 levels in total across all attributes. If I wanted to use all of them, & if population ratio of 1s to 0s was around 2:98, how large of a sample size will I be looking at for 1s & 0s? How would I go about the sampling process? The 2:98 ratio basically is within the real world geographical territory in which the new leads are required. So if I use population ratio, the entire customers+non-customers within that geo territory becomes my training set, I don't have a scoring set to predict new leads anymore? – pborah Feb 13 '17 at 04:42