3

I have a dataset of fraudulent orders from some business. Each order has a bunch of features such as order_amount, address, state, city, phone_number, and name. Obviously a criminal would not be using his/her real name when making a fraudulent order. So I was wondering if there was any sort of machine learning strategy to identify fake names. I assume there must be some sort of underlying structure to how fake names are selected - so understanding this structure could allow me to identify them. Unless the fake names are completely randomly selected. Any thoughts on how to do this?

user1893354
  • 1,435
  • 4
  • 15
  • 25
  • Interestingly, Facebook is trying to do this, but apparently with little success. They even deleted Salman Rushdie's account and then had to issue a formal apology, according to this article: http://www.theverge.com/2012/9/17/3322436/facebook-fake-name-pseudonym-middle-name – Flounderer Oct 24 '13 at 22:26
  • I agree. Do you have any data on the claim that there is an underlying structure ? What if someone selects a real sounding name, which is just not his own ? – mlwida Oct 25 '13 at 08:36

2 Answers2

2

I know it must be far too late for you (only 2.5 years late, I'm quite fast to answer!) but I've been looking upon this problem as well, and found a paper from David Mandell Freeman (Linkedin) that might help other people looking into this.

I haven't tested it yet since my dataset isn't labeled 'fake' or 'valid' (greatest problem ever for the learning phase), but I will soon.

Until then, here is the forementionned paper: http://theory.stanford.edu/~dfreeman/papers/namespam.pdf

The idea is to check for frequency, not of the entire names, but of substrings of the names.

ysearka
  • 148
  • 7
0

This is something that I think you can do in two steps. I had the very same problem years ago, but that was before I knew too much about ML.

What I think would work really well for name identification is a recurrent neural network. Because it's really the sequence of odd letters and numbers that hints at a spammy name, vs a more normal name. IE A name like 9b02ngd9dbee is different from 2beegee1992. A RNN would be perfectly suited for that type of classifying problem.

Then, you combine that with the other features that you mentioned, the order amount, maybe frequency of orders etc... as raw inputs into a really simple fully connected layers of a network then have that be the end result.

1mike12
  • 101
  • 2