Detect unusual trends and anomalies using mixed data (categorical and numerical)

Question

I've been asked to detect "unusual trends and anomalies" using data similar to ATM transaction data. Each entry has a mixture of numerical and categorical variables, things like transaction ID, timestamp, transaction type, transaction amount, etc. There are about 10 categorical and 10 numerical variables. The goal of the project would be to write a script that gives an alert in real time when unusual trends/anomalies are detected in newly logged data.

The exact definition of "unusual trend" or "anomaly" haven't been given to me, and there are no labels to tell me which rows of the dataset are "usual".

To detect anomalies (outliers) I would like to use a distance based measure. I'm not used to calculating these using categorical data, but I believe I could use something like Gower similarity. I could also transform categorical data into a binary vector, e.g. if there are only two transaction types, "withdrawal" = [1 0]. Would it be appropriate to look for outliers from these secondary, all-numerical variables?
I'm less sure how to detect unusual trends. Outlier detection seems inappropriate, since "unusual trends" might not necessarily include data points that are outliers by themselves. If it was a time series, I'd want to use something like a seasonal ARIMA, an autocorrelation function, or something similar. How appropriate is it to change data like what I have (variable time step, categorical + numerical) into a time series? If that's not a good approach, what kinds of models are appropriate for detecting trends in this kind of data?

Thanks a lot. Any help or insight is hugely appreciated!

Are these a series of records over time (eg, all the transactions at a single ATM machine w/ timestamps)? — gung - Reinstate Monica, Sep 25 '15 at 17:35
@gung Yes, the data is a series of records over time, so one of the variables is a timestamp. (It's actually from multiple machines, although I don't know if that changes anything.) — R Greg Stacey, Sep 25 '15 at 17:42
For 1 there are lots of good approaches out there (known in some cases as one-class-classification). Distance methods along with dimensionality reduction methods can be useful. Some application use trained autoencoders to encode and decode an example and use the reconstruction error as a measure of dissimilarity. — jlimahaverford, Sep 25 '15 at 18:17
@jilmahaverford Thanks a lot. Do you have a recommendation for a dimensionality reduction technique for mixed data like this? I'm used to PCA, but I believe that doesn't work for categorical data. — R Greg Stacey, Sep 25 '15 at 18:29

score 2 · Accepted Answer · answered Sep 25 '15 at 18:29

2

What you want to do ( and what I have done ) is to:

Take these (what I would assume) non-standard daily or hourly readings and "bucket them" into periods.
For each transaction and for each kind of machine, create a time series.

This sequence of values can then be analyzed for things like daily, weekly, monthly, and memory effects using a Transfer Function.

This analysis could yield the detection of level shifts and trends using Intervention Detection schemes. There may also be one-time effects that need to be identified and neutralized (anomalies) in order to not distort model parameters.

There may be changes in daily patterns over, error variance, or other model coefficients over time.

There may be (read:will be!) holiday effects around known holidays. Particular days of the month may have an effect. Particular weeks of the month may have an effect. There may be weekend - effects. Lots of things to explore and find out.

With that, I say that if you wish to post some daily data from a 3-4 year period I might be able to help further. If you wished to downsize this to hourly forecasts this could also be done at a later stage. A useful model not only characterizes/describes historical data but also can provide early warning as to the onset of change.

Unusual values do know have to be specified up front but arise detecting values inconsistent with the past.

answered Sep 25 '15 at 18:29

IrishStat

27,906
5
29
55

1

Thanks! That's very helpful. I hadn't considered transfer functions. 1. With regards to constructing the time series, how would you "bucket" the data? Mean values for continuous data? Frequency count for categorical data? 2. I don't have much experience with SARIMAs, but from reading they seem to be an appropriate model for accounting for seasonal / weekend effects. Do you think they'd be a good thing to explore? 3. Finally, what exactly in the transfer function would signal an anomaly or unusual trend? Thanks again! – R Greg Stacey Sep 25 '15 at 18:42
1

a time bucket is the collection of transactions for a particular type of transaction for a particular location for a particular "whatever" that occur in a time interval . It is a frequency count. A Transfer Function is a regression of those values that is souped up enough to also include ARIMA and heretofore unknown events ( pulses/level shifts/local time trends seasonal pulses) You might look at http://stats.stackexchange.com/questions/74978/detecting-outlier-cash-movements/74982#74982 – IrishStat Sep 25 '15 at 18:57
1

Thanks! The link is very helpful. Just because I'm confused though... How would you get a frequency count for the continuous variables? Say withdrawal amount. Would it effectively be a histogram count, e.g. [number of withdrawals less than 50 bucks, number between 50-100, number greater than 100]? – R Greg Stacey Sep 25 '15 at 19:34
1

if you wished to model/forecast/predict at that level then decidely yes.... – IrishStat Sep 25 '15 at 20:35
1

It could be the number of withdrawals OR the total value of the withdrawals – IrishStat Sep 25 '15 at 20:54

Detect unusual trends and anomalies using mixed data (categorical and numerical)

1 Answers1

Linked