0

My dataset contains a single class, which has noisy examples. Up to now I have been converting this to a binary classification problem and using logistic regression, however this does not feel correct, as the negative class has data which is not truly part of a single class, it is just not in the "positive" class.

I stumbled upon and tried OneClassSVM with little success (in terms of classification performance).

What are some possible techniques for learning the distribution of just one class? Am I formulating the problem in the correct manner?

My specific dataset are document vectors created by a doc2vec model.

EDIT: Adding more information based on comments. The task is to classify webpages that will be relevant for certain advertisers based on past performance. We have a positive class, where we know there was a positive interaction. I have been training an LR model based on this positive interaction and a sample of pages where no interaction occurred.

EDIT 2:

Again based on the comments I would like to provide a simpler contrived example to see if my thinking about this is wrong.

Let's say we want to classify webpage as being about football, we could:

  • train a classifier of football VS non-football
  • train a model to learn the distribution of football and treat all other articles as outliers

Is the second option an incorrect way of framing this problem?

dendog
  • 143
  • 6
  • I'm not sure if I understand your first paragraph. What exactly is your data? Could you describe it in greater detail? How exactly did you use logistic regression for it? – Tim May 26 '21 at 09:37
  • What would be wrong with your approach? This does not sound different from spam classification or credit card fraud detection. If it isn't spam, then it doesn't really matter *what* it is, just that it is not spam. – Frans Rodenburg May 26 '21 at 09:42
  • @Tim added more information. – dendog May 26 '21 at 09:50
  • @FransRodenburg I am trying to learn the semantic patterns that drive positive interaction, lets say that is "sport", which forms the positive class, but the negative class is just "everything else" which I felt could be confusing to a binary classifier. – dendog May 26 '21 at 09:52
  • 1
    You don't have one class, you have two: "sport" and "not sport". It is up to your classifier to find the features in the data that discern these two. The fact that one is perhaps a more homogeneous group than the other shouldn't matter, as long as there are differences in the semantic patterns between sport and not sport. – Frans Rodenburg May 26 '21 at 09:58
  • Simple example: Let's say your classifier picked up that any text in which the word "sport" or "tennis" or "winning team" appears is more likely to be about sport than a text that doesn't. The fact that there is a large variety of other words that may appear in non-sport texts is irrelevant. – Frans Rodenburg May 26 '21 at 10:02
  • @FransRodenburg appreciate your comments - but they do not have any explanation behind them. The link is good - I will explore that fully to see if it answers this question. – dendog May 26 '21 at 10:34
  • @FransRodenburg would also appreciate if you could elaborate on when one would have a truly "one" class problem? – dendog May 26 '21 at 10:40

1 Answers1

0

From your description it doesn't sound like a one-class classification problem, but rather that you are lacking data. The usual one-class classification scenario is an anomaly or novelty, detection, where you have "normal" data, learn its distribution and classify things that do not match the distribution as atypical. This is very different from your case.

In your case, you want to "classify webpages that will be relevant for certain advertisers based on past performance". If I understand correctly, you have good performing pages and you want to be able to pick such pages from the data. Let's say that the performance of the webpages is completely random, so the characteristics of the pages are irrelevant to the performance. In such a case, your one-class classifier will learn nothing. If you had data on both cases, you would know that the performance is poor, hence the classifier is useless. If you don't have the data on the second class, you know nothing, so you are risking using a classifier that will give you bogus results and possibly lose money.

You should gather more data. You could do this by routing some of the traffic randomly to different sites, so you would be able to catch both good performing and bad performing sites. With non-random collection, you would risk cherry-picking and ending up with a biased data.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • thanks for the reply, not sure I really follow your point, everything is "bogus" or "useless" it seems... I do have data for sites which both have had a positive interaction and those which have not had an interaction. – dendog May 26 '21 at 11:12
  • @dendog then you don't need one-class classification, don't you? – Tim May 26 '21 at 11:14
  • please see my second edit, I hope this elaborates on why I thought this could be another way of viewing the same problem. – dendog May 26 '21 at 11:22
  • @dendog my answer seems to comment on exactly the question in your second edit. You technically can treat it as a one-class classification but in many cases, this would be a bad idea. By doing so, you are basically building a [confirmation bias](https://en.wikipedia.org/wiki/Confirmation_bias) classifier that is not able to look at any evidence that contradict its predictions. – Tim May 26 '21 at 11:46