Email and IP String preprocessing for classification task

Question

I am relatively new to the field of data-science, pardon my novice question. What are the available methods to convert email and ip to vectors for online learning algorithms. The classification aim is to assess fraud/non fraud transactions. As continues explanation: the other relevant fields are categorical and they have been vectorised.

Related: https://stats.stackexchange.com/questions/68441/application-of-lsa-lsi-is-it-common-to-include-the-use-of-an-edit-distance — Nemo, Aug 20 '17 at 08:41

score 8 · Accepted Answer · answered Aug 06 '15 at 13:04

This is a really interesting question! String vectorization is an area of active research right now, and a there's a ton of interesting approaches out there.

First of all, ip addresses are hierarchical, and can be split by decimals into 4 categorical variables, each with 256 levels (watch out for IPv4 vs IPv6 though)! In a linear model, you can use the top level ip block directly, perhaps interacted with the 2nd, 3rd, and 4th block depending on how much data you have. In a tree-based model (e.g. a random forest or GBM), try converting the ip address to an integer and modeling it directly. A random forest or GBM should be able to identify interesting blocks of the ip range for your model. Most databases have functions to do this conversion, and I know there's a really good R package too.

For email addresses, start by splitting on the @ symbol into address, domain. Domain is probably useful on it's own as a categorical variable, but you might want to further add a variable for .com vs .edu vs .gov, etc. (The urltools package in R can help you extract top-level domains— someone really needs to write an emailtools package!) For the address part (the bit before the @ symbol), you could use a character n-gram vectorizer to create a very wide, very sparse matrix which you can then use directly in your model, or can further process using something like SVD to reduce it's dimensionality. You could also try a word vectorizer, splitting on symbols like ., -, and _.

There's a TON of information in those 2 fields— good luck extracting it!

Thank you for your explanatory answer. Prior posting this question I did some lookaround and found locality preserving hash functions. I persued the approach after reading a few papers and the major limitation is the huge signatures they generate. Is the LSH approach an overkill/dead ally for my scenario? — Segmented, Aug 06 '15 at 13:29
@Segmented I've never actually tried LSH. For IP addresses, converting them to integers and putting them in a GBM has worked really well for me in the past (the integer representation preserves the locality of ip addresses really well). For email addresses, splitting out domain and then character - ngrams has also worked, but is a bit difficult to implement. Maybe try something simple first and see what your results are like? — Zach, Aug 06 '15 at 13:39
thank you for patiently explaining. I am accepting your answer as it defines a clear methodology to approach such a problem. — Segmented, Aug 06 '15 at 14:03

Email and IP String preprocessing for classification task

1 Answers1

Linked