2

To reduce the load on the machine I want to take the benefit of the undersampling approach. Here are a few facts about my data:

  1. My data is of the order of 20 millions or even more.

  2. The event rate is around 0.8%.

Using the undersampling approach, I want to reduce the non-event rate and make it either 70:30 or 50:50. This will help me reduce the load on my machine.

I require an accurate estimate of the dependent variable and it is not required only in terms of its order. Hence, I definitely want to calibrate the probabilities back.

Now, two questions:

  1. Using Firth logistic regression (along with the weight statement) will help in this case or not?

  2. Does undersampling help to build a better model in terms of more accurate regression coefficients? I think that all it should change is the intercept and the rest should be the same.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 1
    I think this is answered [here](http://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients/68726). (1) If you want to reduce bias you can use Firth's method, but there's not necessarily a lot of it just because you've under-sampled. (2) The regression coefficients will be *less* precisely estimated in a smaller sample, but still consistent estimators of the populations odds ratios; except for, as you say, the intercept. – Scortchi - Reinstate Monica Nov 22 '13 at 09:14
  • So decide how much data your machine can comfortably deal with, take all the event data, & make up the difference with non-event data. Fit the logistic regression model (using Firth's penalization if you want), recalculate the intercept according to the population prevalence, & you're done. – Scortchi - Reinstate Monica Nov 22 '13 at 09:36
  • Thanks for the response! Actually I have 12 months worth of data. and I was thinking of taking the complete data for the recent two months and undersample the data for the rest of the previous 10 months. 1. Do you think that Firth's method will help me here? or rather where exactly should i use firth's method? – Ashish Dang Nov 22 '13 at 09:56
  • That's a little odd: if you think that recency makes a difference consider modelling it explicitly. As for when to use Firth's method, how much bias there is in the log odds ratio estimates depends on the predictor patterns - you may want to use it if you've some rare classes. – Scortchi - Reinstate Monica Nov 22 '13 at 10:25
  • Okay I got it! then another question is that undersampling is helpful only to make the computation easier for the CPU resources. If I have to choose between two options: 1. taking full universe but only for the recent 4 months versus taking the undersampled universe but for two years. Which one should be preferred, I know there is no rule of thumb. But what should be the guidelines to select the better of the two? – Ashish Dang Nov 26 '13 at 05:23
  • Ideally speaking, two years of data should help us to build a more stable model and should have seasonality factored in it. However, the model based on the recent 4 months data wont be losing out on any information since there is no undersampling. – Ashish Dang Nov 26 '13 at 05:26
  • Giving good advice here would need me to know a lot about what you're modelling & how the model's to be applied. (If that's nuclear reactor faults & preventing them, please consult a statistician.) With that caveat, I'd be inclined in most circumstances to keep all events over the two years & sample the non-events, perhaps putting 'time' into the model as a covariate. The model based on four months' data loses out on 5/6 of the minority-class events, & data prior to this would have to be *very* irrelevant to offset the loss. – Scortchi - Reinstate Monica Nov 26 '13 at 09:45
  • Nice explanation Scortchi! :) Actually, I am predicting the sales conversion for an online retailer (definitely not even close to the reactors). I agree with you on the point that the recency should offset the loss to make for the loss of many event observations. Adding time as a covariate is a good suggestion - does that mean just adding the week/date as one of the predictor variables or is there more to it than i am seeing? – Ashish Dang Nov 27 '13 at 08:42
  • There could be more to it (think of time series analysis), but my suggestion was a crude one - just to add time as a predictor (say a natural spline with a few knots), then compare that model to the model without time in & see if it makes much difference. – Scortchi - Reinstate Monica Nov 27 '13 at 09:14
  • Makes sense! I was also thinking of trying out some weight (exponential smoothing or something similar) for the time variable. Thanks again! – Ashish Dang Nov 27 '13 at 10:31

0 Answers0