3

Let's say I have a dataset where each item is labeled with either (1) true positive or (2) unknown (could be true positive, could be true negative).

It seems like if there are only true positives labeled, the only penalty you can impose is if negative is predicted for a true positive case. In that scenario, a model that predicts true for every item will have a perfect score.

Two questions:

  • Are there metrics to assess models or literature about being able to use this data with only true positive labels to build models?

  • Are there additional pieces of data that can make the data usable? For example, labeling a small set of true negatives or knowing the overall population rate of positive cases.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
hume
  • 133
  • 4
  • I'm just trying to describe a possible concrete example of this. Suppose I have supermarket frequent shopper data from Safeway, and my aim is to predict which shoppers buy diapers anywhere. I know some instances that are positive (because they bought them at Safeway) but other instances are unknown (all we know is that they didn't buy them at Safeway). Does this roughly describe the type of situation you have? – zbicyclist Jul 18 '19 at 19:08
  • What is your aim? If you want to test a theory I don't it is possible. If you want to guide decisions based on a risk analysis then it could be done. – ReneBt Jul 18 '19 at 20:00
  • The aim is to have some loss function for an ML model that does not just end up predicting true for every item. – hume Jul 18 '19 at 20:19

2 Answers2

3

You're asking about positive-unlabeled learning (PU learning). It's a niche field within machine learning. Typically it arises because the data generation practices of an organization are entirely focused on capturing one class ("positives") and ignoring other classes. For example, police will have lots of data about crimes that have been reported and criminals they have arrested, but little data about law-abiding citizens, and only some data about not-yet-apprehended criminals.

Applying machine learning methods to PU data is more involved, because unlabeled data can either be "positive" or another class, so you need to include that uncertainty explicitly in your model; simply substituting an alternative loss function is not sufficient to tackle the problem.

A recent review paper is "Learning from Positive and Unlabeled Data: A Survey" by Jessa Bekker and Jesse Davis. The authors outline current research in seven areas:

  1. How can we formalize the problem of learning from PU data?
  2. What assumptions are typically made about PU data in order to facilitate the design of learning algorithms?
  3. Can we estimate the class prior from PU data and why is this useful?
  4. How can we learn a model from PU data?
  5. How can we evaluate models in a PU setting?
  6. When and why does PU data arise in a real-world setting?
  7. How does PU learning relate to other areas of machine learning?
Sycorax
  • 76,417
  • 20
  • 189
  • 313
1

You seem to refer to the one-class classification problem:

A highly-cited paper on the topic is 'Support Vector Data Description'.

http://homepage.tudelft.nl/a9p19/papers/ML_SVDD_04.pdf

The idea is to provide a description of a training set of objects and to detect which new objects resemble the training set.

GrigorisG
  • 111
  • 2