How can I make use of zip codes when I am building a model for fraud detection

Question

I have gone through few articles but I am not convinced on what should I do with these. I know from business standpoint it might be good to consider fraudulent transactions happening from unknown locations. But I don't know how to use this in my data as dummy encoding might not be good solution.

how to represent geography or zip code in machine learning model or recommender system?

Have a look at [principled-way-of-collapsing-categorical-variables-with-many-levels](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels) — kjetil b halvorsen, Mar 17 '19 at 10:20

score 0 · Answer 1 · answered Jun 29 '19 at 15:01

Why do you think that

... as dummy encoding might not be good solution

? If it is because of memory issues, find some software that is using sparse matrices, like glmnet. Since there are very many different zip codes, have a look at Principled way of collapsing categorical variables with many levels? and the suggestions there. I would try out the fused lasso.

It is also possible that for your application, some domain knowledge would be helpful for feature construction.

score 0 · Answer 2 · answered Feb 06 '22 at 14:22

If you have many independent observations per zip code and ample compute resources, you might try to use the zip code as-is.
Instead of including a categorical variable for zip code, you might acquire demographic data for each zip code, join it to your data set, and include those numeric variables instead. Perhaps income level, voting behavior, etc. could conveniently describe the variance in the response, saving you hundreds or thousands of dummy variables.
Statistics has an elegant and powerful way to include categorical variables with many levels: Random effects. In this case, the coefficients for each zip code would be assumed to come from a normal distribution, and cleverly fit with maximum likelihood instead of least squares. It has the effect of keeping the coefficients for the zip codes closer to each other (controlling overfitting) and keeping the standard errors of the estimates under control. The most commonly-used function for this is lmer from the R package lme4.

How can I make use of zip codes when I am building a model for fraud detection

2 Answers2

Linked