teaching AI models to say "i dont know" using sample distributions

Question

i would like to begin with apologizing for this long post ..with that out of the way, here goes

we build email bucketing tools and the biggest challenge we had was to get an IDK / IRR output from the ensemble model we use. Our final layer is a softmax (no surprises there) and the number of buckets can range from 50 to 200. But we have a lot of customers that are still sorting out / rehashing their buckets so in cases where the model sees a totally new category it should have the ability to just flag it to a human agent and say, "hey, i dont understand this ..

sadly this is impossible to do in a supervised learning model since by design its constrained to pick one of the buckets based on the softmax score..i have been racking my brains and came up with an approach that seems to work ok but i am unable to understand its efficacy :S .. please hear me out

approach -

train an embedding vector (glove in this case) using data specific to that customer to make it as contextual as possible
train using supervised model as usual
additional step here would be to summarize the vector for all sentences in a particular bucket..for every sentence i sum/avg individual word vectors and then for the entire bucket i average the x-dimension vector i get for every sentence
now when a new sentence comes in for inference, i first of all convert the sentence into a vector using the same approach and then use kolmogorov-smirnov 2 sample test to check if the incoming vector matches any of the category average vectors. If the max p-value i get for any of the buckets is < 0.05 i know for sure that this sentence has some trouble being classified (since it most likely comes from a very different distribution) by the system and i simply flag it ..if its above 0.05 i then pass it to the model to bucket it
we tested this approach using a lot of data ( a few thousand emails ) and achieved F1 scores of 70-80% for almost all buckets and 90% for the IRR bucket .. before this approach our F1 score for the buckets would be in the 75-85 range so our efficacy on known categories has gone down a bit but with a huge advantage of being able to flag unknown categories

Question

does it make sense to compare these vectors as distributions ? afaik, glove basically computes these vectors such that the weights/values in it apply nicely to tasks like analogy and align in the same direction as words in similar contexts..but can these be treated as samples from a distribution ??
i tried literally all different distribution similarity tests, especially shapiro ( where, if i had to do a 2 sample test, i just compared each to the normal and only if any of their null hypo could be rejected , could i use the result ) and strangely the p values were quite large ( definitely a few x larger than 0.05 threshold) ..only KS test gave me excellent results ..basically very low p values whenever the sentences were from very different domains ..
KS test typically measures the MAD between the 2 distributions so is it fair to assume that since the 2 distributions are nowhere close to normal, this would be a good test to gauge the difference ?

again apologize for the long post

Dave · Answer 1 · 2020-09-09T16:37:49.283

0

sadly this is impossible to do in a supervised learning model since by design its constrained to pick one of the buckets based on the softmax score

Your softmax layer gives you probabilities, not categories. You have to use some kind of decision rule to then assign a category, and while there is a default in software (probably take the category with the highest probability), this does not work for your problem. If no probability of being in a particular category is sufficiently strong to declare "stick in in bucket A," you don't have to put it in any of your buckets. You can have that result in a print("Vikram, I really don't know!").

Shamelessly, I will link a question of mine with a good answer that is related to this. Like you, I saw my decision as being about the class label to which I should assign an email. Au contraire! In my spam/ham email example, if the model is not confident either way, maybe, as Kolassa mentioned, the right way to proceed is to tag the subject line with "suspected spam" rather than sending the message straight to the spam inbox. You can do something similar.

What you're doing with the Shapiro-Wilk and Kolmogorov-Smirnov tests doesn't make sense to me, but I think you solve your problems by considering the probability outputs of your model.

edited Sep 09 '20 at 16:37

answered Sep 09 '20 at 16:32

Dave

28,473
4
52
104

Thanks Dave but your example is way too simple for my use case ..like I said it's a 50-200 class classification problem with routine overlaps between categories..for e.g someone enquiring about the status of their insurance claims vs someone asking how to file one..so even a 0.6 or 0.7 score doesn't rule out a mistag and a lot of our top scores are in the 0.4-0.7 ranges with a very rare 0.9 ..also the fact that you have to heuristically select probability thresholds..that's proved quite dangerous in our experiments as it generates a lot of false negatives for genuine cases – Vikram Murthy Sep 10 '20 at 01:50
Also, softmax output is NOT probability in statistics ..it's still a score ..https://towardsdatascience.com/softmax-and-uncertainty-c8450ea7e064 – Vikram Murthy Sep 10 '20 at 01:55
Softmax gives the estimated parameters of a conditional multinomial distribution, same as you g set in logistic regression for a conditional binomial distribution. // You’re using a decision rule to give a category no matter how you make that decision. If you want to use the software default, that’s an option (apparently not good enough for your task), but you have other options by considering the probabilities in the context of this particular problem. – Dave Sep 10 '20 at 02:23

teaching AI models to say "i dont know" using sample distributions

1 Answers1