I've been asked to detect "unusual trends and anomalies" using data similar to ATM transaction data. Each entry has a mixture of numerical and categorical variables, things like transaction ID, timestamp, transaction type, transaction amount, etc. There are about 10 categorical and 10 numerical variables. The goal of the project would be to write a script that gives an alert in real time when unusual trends/anomalies are detected in newly logged data.
The exact definition of "unusual trend" or "anomaly" haven't been given to me, and there are no labels to tell me which rows of the dataset are "usual".
To detect anomalies (outliers) I would like to use a distance based measure. I'm not used to calculating these using categorical data, but I believe I could use something like Gower similarity. I could also transform categorical data into a binary vector, e.g. if there are only two transaction types, "withdrawal" = [1 0]. Would it be appropriate to look for outliers from these secondary, all-numerical variables?
I'm less sure how to detect unusual trends. Outlier detection seems inappropriate, since "unusual trends" might not necessarily include data points that are outliers by themselves. If it was a time series, I'd want to use something like a seasonal ARIMA, an autocorrelation function, or something similar. How appropriate is it to change data like what I have (variable time step, categorical + numerical) into a time series? If that's not a good approach, what kinds of models are appropriate for detecting trends in this kind of data?
Thanks a lot. Any help or insight is hugely appreciated!