I’m hoping someone might be able to point me in the right direction for this problem. I’ve anonymised the problem by describing it in terms of supermarket purchase data rather than the real context so the problem might seem a bit nonsensical but the idea should be the same. The data fields are all categorical.
I have a large dataset with product purchases (each row is a purchase) including 10-15 attributes associated with each purchase. I want to identify purchases that relate to instances where either the number or relative frequency of a particular value in a particular field becomes unusually high. For example, let’s say one field is ‘location of purchase’. If Manchester purchases normally are 1.5% of the population but from Monday 9am to Monday 12am are 70% of the population, that’s unusual. Furthermore, I want to be able to identify that if 60% of those purchases are for tomatoes, that’s unusual because purchases of tomatoes in Manchester are not normally 42% of the population. I need to do this for period n using only the time periods 1,2 …, n-1.
The things I have considered are:
- Computing mean and standard deviation of number of purchases for a particular attribute and particular value in historic time periods (averaging across the historical periods) and then measuring number of standard deviations from the mean at the current time. I could do the same thing for the proportion of purchases. The problem is I then need to have a threshold.
- Compute the entropy of the current time period (using either number or proportion of purchases as input) and compare to previous entropies, but again I need a threshold.
Aside from the problems I mentioned these also feel a bit basic; I am sure there is some better method but I can’t think of what it might be. Some thoughts that I have had that may or may not help someone else answer this:
- I don’t think this can be solved by anomaly detection approaches since the ‘anomaly’ relates to many data points and the data at time n, n+1, n+2 ... is not available
- I’m more of a machine learning than statistics person but I can’t think of any machine learning approaches to this (open to hearing any ideas)