1

I have a set of user data and I want to build some kind of metric to evaluate the probability of the user being a sybil (a "fake" account).

But I have a very limited set of users who are sybils with 100% certainty.

How do I use machine learning here?

Also, as for now, I've built a heuristic metric based on that data and need to evaluate it somehow.

To sum up: I have a small fraction of data that is labeled and only negative class. And need to build a metric to evaluate users. On top of that I need evaluate the "goodness" of that metric?

How do I approach this problem?

ps It would be good if I could scale this process for big datasets.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
esengie
  • 11
  • 1
  • You need some examples of each class. Maybe start with logistic regression, and try to extend it to case where not all individuals have a certain class identity. Some ideas can be found here: https://stats.stackexchange.com/questions/174856/semi-supervised-classification-with-unseen-classes – kjetil b halvorsen Feb 18 '18 at 14:49
  • You also find a lot about sybil detection by googling: https://www.usenix.org/conference/usenixsecurity13/technical-sessions/presentation/wang – kjetil b halvorsen Feb 18 '18 at 14:52

0 Answers0