For the sake of simplicity, let's say I'm working on the classic example of spam/not-spam emails.
I have a set of 20000 emails. Of these, I know that 2000 are spam but I don't have any example of not-spam emails. I'd like to predict whether the remaining 18000 are spam or not. Ideally, the outcome I'm looking for is a probability (or a p-value) that the email is spam.
What algorithm(s) can I use to make a sensible prediction in this situation?
At the moment, I'm thinking of a distance-based method that would tell me how similar my email is to a known spam email. What options do I have?
More generally, can I use a supervised learning method, or do I necessarily need to have negative cases in my training set to do that? Am I limited to unsupervised learning approaches? What about semi-supervised methods?