How do you encode relationships between High-Cardinality, Categorical features?

Question

I realize that some derivative of this question has been asked here before, but none have addressed the situation where there is ONLY high-cardinality, categorical data, and that the labels themselves are not important, but the combinations of labels are.

The data will often look something like this (in this case, log data from a firewall)

timestamp          source ip      dest ip    dest url          categories
12:00:00.000       4.4.4.4        1.2.3.4    www.badguy.com    malicious, shopping, security
12:00:00.001       3.3.3.3        6.7.8.9    www.badguy.com    malicious, shopping, security

Ultimately, I am trying to classify the billions of lines of nominal features into clusters of arbitrary 'personas' that we can then use to help detect anomalous behavior and/or predict which 'persona' a particular behavior could be attributed to. About the only feature that could possibly be OHE is 'categories', but it gets ugly very fast, since we have trillions of possible combinations of categories to IPs to urls.

I have considered just calculating the probability of an exact combination of features occurs and using that as a feature (maybe with weighting based on domain knowledge), but I don't see how that could serve as anything meaningful in a classification algorithm.

Does anyone have any suggestions for encoding feature combinations for use in a simple clustering model?

I should also mention that I have looked at [this](http://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html) fantastic article, but again it doesn't apply. We aren't trying to predict income (numeric) from zip codes (nominal, low cardinality), for instance. It's more like we're trying to cluster zip codes, first names and house color into their natural groupings, but there are millions of zip codes, first names, and house colors. — TheProletariat, Apr 26 '17 at 22:39
You *need* to perform good feature extraction. Too much noise, too little signal. — Has QUIT--Anony-Mousse, Apr 27 '17 at 07:46
See also https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Jun 25 '19 at 20:22

score 0 · Answer 1 · answered Aug 03 '19 at 18:47

This is more of an extended comment. The question is interesting and should have a proper answer. You might look as your nominal values as analogous to words in a text, and look at methods for word embedding. Then the embeddings give numerical values which can be used as inputs in a model.

There are many posts about that at this site, see this list. Some other similar posts is Where to find a guide to encoding categorical features?, How to use more features in text based machine learning models beyond the text itself?

How do you encode relationships between High-Cardinality, Categorical features?

1 Answers1

Linked