Trying to use historical data to make an event 'non-rare' in a logistic regression model

Question

I am working on a logistic regression model that attempts to predict failure events in a population of devices using the previous 10 days of data (60 features from sensor data). The failure event is rare; I would expect to find approximately 20-30 on a given day, and the total population size is > 10,000.

Here is where I may be mistaken: I have access to years-worth of historical data, so I thought that I would be slick and make the event 'non-rare'. That is, collect 1000 failure examples and 1000 non-failure examples, then estimate the model on that. This, I thought, would give the model enough information to determine the relationships between sensor readings and class membership.

However, I am starting to think that my intercept term (and others?) may be problematic because the real initial class probabilities are not respected.

Is the model corrupted by this disregard of the real event probabilities?

If I'm reading this correctly, this sounds akin to a case-control study in epidemiology/biostatistics -- that is, sampling on event status (event/non-event) and then considering differences in exposure status. This is useful for modelling odds ratios and answering causal questions (e.g. is paste exposure to Substance X more common amongst brain cancer patients than controls?) but less useful if you're looking at predicting event status based on other features (and it sounds like your data structure is more complex than this simple example anyway.) — James Stanley, Sep 15 '15 at 22:30
See [Does down-sampling change logistic regression coefficients?](http://stats.stackexchange.com/q/67903/17230). — Scortchi - Reinstate Monica, Sep 18 '15 at 11:54

score 3 · Accepted Answer · 2015-09-18T14:53:52.120

3

What you do is 'over sample' on the 'failures' and as you conclude, this biases your estimates. When you do not draw a random sample, but a sample based on the values of the binary outcome, then you are doing a so-called 'choice based sampling'.

The correction of King and Zeng is one option, another option is to use the weights parameter in glm: If $\pi$ is the fraction of failures in your population and $p$ the fraction in the sample, then each observation gets a weight which is $\frac{\pi}{p}$ for the failures and $\frac{1-\pi}{1-p}$ for the non-failures. For details see e.g. W.H. Greene, Econometric Analysis. Note that in the latter case you should use White's sandwich estimator for the var-covar matrix as was mentioned by @Scortchi in the comments.

edited Sep 18 '15 at 14:53

answered Sep 16 '15 at 06:42

(-1) The `glm` manual says "For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes". So these are frequency weights, **not** sampling weights: standard errors for odds ratio estimates will be hugely inflated. (BTW I hope it doesn't seem like I'm picking on you today - you should've also noticed a few up-votes.) – Scortchi - Reinstate Monica Sep 18 '15 at 12:46
@Scortchi: no hard feelings, if I am wrong I am wrong, but in this case I am not so sure whether I am, did you check the formula in the reference ? Because if you look at it, these are the weights of the observations in the likelihood function that is maximised ? – Sep 18 '15 at 13:19
Which reference? – Scortchi - Reinstate Monica Sep 18 '15 at 13:47
@Scortchi: W.H. Greene, Econometric Analysis – Sep 18 '15 at 13:48
I don't have it to hand. I've just looked at King & Zeng: they mention this method of weighting - in fact they say it's more robust under model mis-specification than prior correction -, but note that "the usual method of computing standard errors is severely biased" & recommend using White’s standard errors, wh. I think means [White (1980). "A heteroskedasticity-consistent covariance matrix estimator & a direct test for heteroskedasticity", *Econometrica*, **48**, 4](https://www.jstor.org/stable/1912934). – Scortchi - Reinstate Monica Sep 18 '15 at 13:54
1

@Scortchi: You are right about the var-covar estimator, The reference that I menstion (Greene) also says that you should use the ''White's robust sandwich estimator'', Greene refers to White (1982a), Maximum lilekihood estimation of misspecified models, Econometrica 53, 1982, p 1-16. But I think you can no longer say that my answer is wrong ? – Sep 18 '15 at 14:09
Important to mention that in the answer, I think. – Scortchi - Reinstate Monica Sep 18 '15 at 14:13
@Scortchi: well if you look at the question I think my answer answers that question, moreover I have a reference 'for detail see ...' where that is mentioned ? But as I said, no hard feelings :-) – Sep 18 '15 at 14:16
True but it's hardly a detail. (And I fear not many will check the reference.) Just one little edit ... :( – Scortchi - Reinstate Monica Sep 18 '15 at 14:24
@Scortchi: you're right, I changed it – Sep 18 '15 at 14:54

score 1 · Answer 2 · answered Sep 15 '15 at 17:26

This approach definitely does corrupt the model. The output of logistic regression is meant to convey the probability of an event given some configuration of independent variables. Explicitly removing large amounts of data because of the value of the dependent variable will definitely harm this aspect of the model's output.

It sounds like your goal is to classify as opposed to using these probabilities, so if the skew is leading to two many false negatives, you should produce an ROC curve and determine if there is some threshold which produces an acceptable amount of false positives/negatives.

The problem that you discuss has been studied greatly and is often referred to as one-class classification. When the vast majority of the data is "normal" and the goal is to detect "anomalies" (for example flight malfunction) there are many methods which essentially focus on understanding normality as opposed to understanding anomaly.

From what I am reading, it appears I may be able to estimate the model using my 'enhanced' dataset created by selecting on Y, then correct the intercept manually using the King and Zeng method (base-rate correction). Can you speak to that at all? — HEITZ, Sep 15 '15 at 21:31
While I've heard of the King and Zeng method, I do not know it. I may look into it, but can't make any promises. I do strongly suggest you look into methods of "one class classification" though. — jlimahaverford, Sep 15 '15 at 21:34

Trying to use historical data to make an event 'non-rare' in a logistic regression model

2 Answers2