I realize that some derivative of this question has been asked here before, but none have addressed the situation where there is ONLY high-cardinality, categorical data, and that the labels themselves are not important, but the combinations of labels are.
The data will often look something like this (in this case, log data from a firewall)
timestamp source ip dest ip dest url categories
12:00:00.000 4.4.4.4 1.2.3.4 www.badguy.com malicious, shopping, security
12:00:00.001 3.3.3.3 6.7.8.9 www.badguy.com malicious, shopping, security
Ultimately, I am trying to classify the billions of lines of nominal features into clusters of arbitrary 'personas' that we can then use to help detect anomalous behavior and/or predict which 'persona' a particular behavior could be attributed to. About the only feature that could possibly be OHE is 'categories', but it gets ugly very fast, since we have trillions of possible combinations of categories to IPs to urls.
I have considered just calculating the probability of an exact combination of features occurs and using that as a feature (maybe with weighting based on domain knowledge), but I don't see how that could serve as anything meaningful in a classification algorithm.
Does anyone have any suggestions for encoding feature combinations for use in a simple clustering model?