i would like to begin with apologizing for this long post ..with that out of the way, here goes
we build email bucketing tools and the biggest challenge we had was to get an IDK / IRR output from the ensemble model we use. Our final layer is a softmax (no surprises there) and the number of buckets can range from 50 to 200. But we have a lot of customers that are still sorting out / rehashing their buckets so in cases where the model sees a totally new category it should have the ability to just flag it to a human agent and say, "hey, i dont understand this ..
sadly this is impossible to do in a supervised learning model since by design its constrained to pick one of the buckets based on the softmax score..i have been racking my brains and came up with an approach that seems to work ok but i am unable to understand its efficacy :S .. please hear me out
approach -
- train an embedding vector (glove in this case) using data specific to that customer to make it as contextual as possible
- train using supervised model as usual
- additional step here would be to summarize the vector for all sentences in a particular bucket..for every sentence i sum/avg individual word vectors and then for the entire bucket i average the x-dimension vector i get for every sentence
- now when a new sentence comes in for inference, i first of all convert the sentence into a vector using the same approach and then use kolmogorov-smirnov 2 sample test to check if the incoming vector matches any of the category average vectors. If the max p-value i get for any of the buckets is < 0.05 i know for sure that this sentence has some trouble being classified (since it most likely comes from a very different distribution) by the system and i simply flag it ..if its above 0.05 i then pass it to the model to bucket it
- we tested this approach using a lot of data ( a few thousand emails ) and achieved F1 scores of 70-80% for almost all buckets and 90% for the IRR bucket .. before this approach our F1 score for the buckets would be in the 75-85 range so our efficacy on known categories has gone down a bit but with a huge advantage of being able to flag unknown categories
Question
- does it make sense to compare these vectors as distributions ? afaik, glove basically computes these vectors such that the weights/values in it apply nicely to tasks like analogy and align in the same direction as words in similar contexts..but can these be treated as samples from a distribution ??
- i tried literally all different distribution similarity tests, especially shapiro ( where, if i had to do a 2 sample test, i just compared each to the normal and only if any of their null hypo could be rejected , could i use the result ) and strangely the p values were quite large ( definitely a few x larger than 0.05 threshold) ..only KS test gave me excellent results ..basically very low p values whenever the sentences were from very different domains ..
- KS test typically measures the MAD between the 2 distributions so is it fair to assume that since the 2 distributions are nowhere close to normal, this would be a good test to gauge the difference ?
again apologize for the long post