0

I’m hoping someone might be able to point me in the right direction for this problem. I’ve anonymised the problem by describing it in terms of supermarket purchase data rather than the real context so the problem might seem a bit nonsensical but the idea should be the same. The data fields are all categorical.

I have a large dataset with product purchases (each row is a purchase) including 10-15 attributes associated with each purchase. I want to identify purchases that relate to instances where either the number or relative frequency of a particular value in a particular field becomes unusually high. For example, let’s say one field is ‘location of purchase’. If Manchester purchases normally are 1.5% of the population but from Monday 9am to Monday 12am are 70% of the population, that’s unusual. Furthermore, I want to be able to identify that if 60% of those purchases are for tomatoes, that’s unusual because purchases of tomatoes in Manchester are not normally 42% of the population. I need to do this for period n using only the time periods 1,2 …, n-1.

The things I have considered are:

  • Computing mean and standard deviation of number of purchases for a particular attribute and particular value in historic time periods (averaging across the historical periods) and then measuring number of standard deviations from the mean at the current time. I could do the same thing for the proportion of purchases. The problem is I then need to have a threshold.
  • Compute the entropy of the current time period (using either number or proportion of purchases as input) and compare to previous entropies, but again I need a threshold.

Aside from the problems I mentioned these also feel a bit basic; I am sure there is some better method but I can’t think of what it might be. Some thoughts that I have had that may or may not help someone else answer this:

  • I don’t think this can be solved by anomaly detection approaches since the ‘anomaly’ relates to many data points and the data at time n, n+1, n+2 ... is not available
  • I’m more of a machine learning than statistics person but I can’t think of any machine learning approaches to this (open to hearing any ideas)

1 Answers1

0

The things you have considered are pretty standard. Based on your description, it sounds like you do have thresholds that make sense to you, but they are combinations, not a single number of an attribute. The real solution, then, will depend on your domain expertise and what analyses reach your version of a threshold. Some things to look into:

  1. z-score of the change in attributes. You may find that analyzing the derivative of your attribute comes closer to your intuition, and that setting a threshold of 4 or 5 captures the changes you're looking for.
  2. Mahalanobis distance is quite useful for multi-variate distributions. You can provide a window of attributes and then check if a single row would fall into that distribution. In your case, you might need to label your history to capture distances that make sense to you (compute distance and run it through a random forrest classifier or something), but you can also compute histograms and select bins as outliers.
  3. There is a good review, including a few other techniques for multivariate distributions here, which focuses on outliers. I've had success with the projection methods, but it isn't a one-size-fits-all solution.

In general, "outlier" is very contextual and depends on your version of what is unusual. There isn't a blackbox tool for that which isn't simply statistical (i.e. "basic"). But, with machine learning, you can label and teach using some of those methods!

wwwslinger
  • 1,150
  • 7
  • 10
  • thanks for your thoughts and link. The thing is, I don't think I can use standard outlier detection methods because being an outlier here means a volume of data points in the same region. I was thinking of computing aggregations on the data (count/percentage) which mean it's possible to compute a z-score as you suggested. The Mahalobis distance assumes continuous data. – soundofsilence Sep 22 '19 at 23:27