My dataset contains a single class, which has noisy examples. Up to now I have been converting this to a binary classification problem and using logistic regression, however this does not feel correct, as the negative class has data which is not truly part of a single class, it is just not in the "positive" class.
I stumbled upon and tried OneClassSVM with little success (in terms of classification performance).
What are some possible techniques for learning the distribution of just one class? Am I formulating the problem in the correct manner?
My specific dataset are document vectors created by a doc2vec model.
EDIT: Adding more information based on comments. The task is to classify webpages that will be relevant for certain advertisers based on past performance. We have a positive class, where we know there was a positive interaction. I have been training an LR model based on this positive interaction and a sample of pages where no interaction occurred.
EDIT 2:
Again based on the comments I would like to provide a simpler contrived example to see if my thinking about this is wrong.
Let's say we want to classify webpage as being about football
, we could:
- train a classifier of football VS non-football
- train a model to learn the distribution of football and treat all other articles as outliers
Is the second option an incorrect way of framing this problem?