2

Is there a theory, or a practice rule backed by some theory, on whether one should use cases or controls as target class in building predictive models? By case I mean the minor class.

In credit scoring, which I am not new to, usually the model is built so that the prediction is proportional to the probability to be good: FICO score: 300 to 850, VantageScore 3.0: 300 to 850 etc. (from here)

In biomedcal papers autorhs usually consider the models that describe somewhat the probability to be a "case": the icreased odds of a disease etc.

From experimenting with datasets, but not to publishable extent, it looks to me that the common methods such a regressions, trees, and ensembles thereof do not care which class the target is - minor or major.

Maybe there are any interesting papers that I missed or come basic theory I do not understand?

coulminer
  • 337
  • 2
  • 8
  • 1
    See [Logistic regression: what happens to the coefficients when we switch the labels (0/1) of the binary outcome?](http://stats.stackexchange.com/q/168637/17230). – Scortchi - Reinstate Monica Mar 05 '16 at 18:19
  • Thanks, for the thread. Of course, contingency tables will mirror hence log odds will change sign, so no problem in formulas. But is your awaiting the same results from a heuristic algorithm gut-feeling based or stems from something I don't know? My very humble gut feeling says in case of imbalanced classes there might be a difference depending on major/minor class being used for training? Sorry for not being able to elaborate fully on the feelings. – coulminer Mar 06 '16 at 12:38
  • Just offering logistic regression as an example of an approach to predicting probabilities of class membership that does have this desirable equivariance property. Like @EdM, I'm unaware of any regression/classification algorithms for which the arbitrary labelling of classes would make a difference; one for which it did would already have a black mark against it. – Scortchi - Reinstate Monica Mar 06 '16 at 12:58

1 Answers1

2

The simple answer is that both "cases" and "controls" are used in building a model. You need to know characteristics associated with both in order to distinguish them. Once the model is built, if you know the odds, say, of having the disease then there is no difficulty in figuring the odds of not having the disease.

I'm unaware of any approaches to regression/classification that will be more or less efficient computationally depending on which is chosen as the "target" class. There is one practical consideration, however: the number of variables you can reasonably include in a model is limited by the least-frequent class. For example, in standard logistic regression you should not consider more than 1 predictor variable per 15 cases in the least-frequent class.

Your question, however, raises the important and sometimes overlooked issue of whether a particular analysis should care more about "cases" or "controls." The usual "classification accuracy" scores based on numbers of mis-classifications or areas under receiver operating characteristic curves have an implicit assumption that both types of mis-classifications are equally important. That's seldom the case in practice, even with the same model.

For example, say you had a model predicting the probability that a company would fail within 5 years, and that failures are the "cases." If you are a conservative investment banker, you would probably want to avoid "cases" in your portfolio. If you were a venture capitalist, you would probably want to factor in the potential gains along with the probability of being a "case." Same model, different uses.

That's one reason why on this site you will often find recommendations to build models that predict probabilities of class membership, as with logistic regression, rather than a simple yes-no classification scheme. This page and its links go into more depth.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Maybe I should update the question if that was so black and white. If true, please advice so. What I wanted to ask is whether there is any difference from practical standpoint to train the models with the minor target class or the major target class. It seems that there is no difference, but sine many models are heuristic, maybe there is something to choosing the target class properly? – coulminer Mar 05 '16 at 16:54