0

I am trying to predict if a doctor is likely to switch from prescribing Drug A to Drug B.

Based on my understanding of logistic regression, you can use the independent variables to determine the probability of the dependent variable which can be used to classify binary outcomes.

So, I can create a logistic model, and feed it records of doctors who have been observed to switch from Drug A to Drug B. Then, I can apply the model to a broader dataset to estimate who has switched.

But I don't really care about who has switched. I already know that since I have the data. I want to know who is likely to switch in the future.

I am essentially trying to feed a model which will pop out a binary prediction that tells me if a doctor is likely to switch from writing Drug A to Drug B in the future.

Is logistic regression the way to go to do this?

Eric
  • 3
  • 1
  • 1
    Short answer: it can, long answer: There are lots of methods for predicting. What method you choose will largely depend on your data, and you will get barraged with a million opinions on what the _best_ way is. Logistic regression has the advantage that your coefficients give you odds ratios that are directly interpretable. This might also be preferable if you have only a few columns because it's not likely you'll get very good predictions, so you might prefer logistic regression which can be more informative. Also good for small datasets because its estimates have low variance. – Huy Pham Nov 05 '19 at 17:57

1 Answers1

2

There are many predictive models for binary classification, but yes, logistic regression augmented with a decision rule is one way to go about it. But it seems you have some slight misunderstandings with respect to how to apply it.

So, I can create a logistic model, and feed it records of doctors who have been observed to switch from Drug A to Drug B.

Yes, but you'll also want to feed it the records of doctors who have NOT switched. The model needs to learn what values of the independent variable are typically associated with switching and what values are typically associated with not switching.

But I don't really care about who has switched. I already know that since I have the data. I want to know who is likely to switch in the future.

You care about them in the sense that they will teach your model what switchers and non-switchers looks like, so that it can predict future outcomes

After you've trained the model on data with known outcomes, you can apply it to unknown cases, the model will spit out a probability, and then you can decide what to do with that probability. This is really where the logistic regression model ends. If you need to spit out a binary "will switch" or "will not switch", you're stepping outside the bounds of the probability model, but obviously people do it all the time.

So: data cleaning → model fitting → model diagnostics → (if model is good) apply to unknowns → make binary decision if you have to

klumbard
  • 1,291
  • 11
  • 14
  • Thanks for the response. Understood on your two comments about building the model. When I apply the model to a new set of data, lets say the model predicts a doctor is a switcher. I look at the data for said doctor and see that, they are NOT in fact a switcher. I'm trying to use that to say, well they haven't switched **yet** but they may in the future since they have similar characteristics to those who have switched. Basically I'm using the false positives as the ones I actually care about. Sounds like this is not the intended use of logistic regression but could be done if I really wanted – Eric Nov 05 '19 at 18:30
  • I see. Sounds like a more complicated problem than can be handled by standard by logistic regression because there's an implicit longitudinal nature to it. If you're able to reframe your problem you might be able to do something like this: measure independent variables at time 0, then check N years later to see if doctor has switched ("yes/no"). This is your new dependent variable, and now you're examining with logistic reg. whether someone is likely to switch within N years given indep variables. If this isn't an option for you, maybe look at more complex models for longitudinal binary data. – klumbard Nov 05 '19 at 19:26