0

I am researching and doing a project related to the detection of fraudulent transactions in the financial system. For this research we are working with unsupervised learning, more precisely we are using the Isolation Forest prediction model. I want to make a query referring to the response of the user @grochmal (I will put the response at the end of my post). I find the summary you made in the paragraph bold highligh very interesting, but could you explain to me:

  1. why you eliminate the columns with very low and very high variance?

  2. Why within the groups of correlated columns you select the columns with variance closest to 0.5?

  3. Why shouldn't we perform dimensionality reduction (FSIV, t-SNe, PCA, etc) on our dataset?

  4. And finally I wanted to ask you why it is important to leave independent columns or features in the dataset previously to modeling with Isolation Forest ? In other words, why it is important to leave columns or features that have no correlation with each other ?

Thank you very much for your time, below I leave the link of the response and the written response.

https://stats.stackexchange.com/a/407538/317756

"I'll argue that fraud detection will have very few positive cases and a big number of negative cases in any reasonably real-like dataset. In other words, you will be solving an anomaly detection problem, where the fraud is the anomaly and normal (non-fraud) cases are the remaining samples.

Answering the question itself: In anomaly detection there are two important parts to any detection system:

  1. The capability to add new features easily.
  2. The capability to explain which features caused a point to be considered an anomaly (fraud).

This leads to two things that we can say about fraud detection: (1) you cannot perform generic feature selection (e.g. feature hashing) or dimensionality reduction, since that will destroy the explainability of your system. And (2) you need a model that will cope with extra (likely correlated) features.

Personal Experience

From my experience trying to build an anomaly detection system for financial data (not strictly fraud detection but pretty close) I will argue that you want to start with as many features as you can. In my case we jut took the entire record the payment system we looked at had in the log (in a SQL table) for the transaction and added to it what we could take from the application logs as well.

We ended with something including: several global locations for the transaction origin and issuer (city, country); all data on the card chip sent both ways, ciphers used and ciphers negotiated, timestamps in both timezones (origin and issuer), time to process the transaction, and a couple of other more business specific parts of the log (e.g. visa or mastercard).

A lot of the above was heavily correlated (e.g. date with time of transaction, time of the day of transaction), so we went over each column, estimated the variance (number of distinct values in the column divided by total number of data points) and:

  1. If the variance was too low (only 1, 2 or 3 distinct values) we did throw away the column.
  2. If the variance was too high (>0.8 using the estimate above) we did throw away the column.
  3. We correlated the columns and within groups of columns (high correlation) we choose the one column with variance (estimated as above) closer to 0.5 .

Based on the variance estimate we also decided how to engineer the features: For timestamps we took mostly the hours of the day (i.e. without the date part) since that was the column with the most reasonable variance estimate. For text features with a very limited scope (e.g. ciphers) we did one-hot-encoding. For long strings with a lot of variance we found out that the length of the strings is quite a good indicator. And so on, column by column.

We then used an isolation forest to find the anomalies. Every time we need to add more features because someone (a human reporting a problem) find another good indicator, we correlate the new indicator column with the columns we have and decide whether we can improve the detection engine by adding the feature or by tweaking sample weights.

In summary (TL;DR)

You are facing an anomaly detection problem. You want to place in as many features as you can, and then you need some way of of estimating correlation and variance to evaluate the feature columns in a raw state (e.g. no dimensionality reduction). Then select the columns (features) looking at the estimates."

0 Answers0