As @DJohnson stated, "ground truth" is difficult in such situations - so there is no simple "absolute reference" to compare to. But you could directly compare 2 results that were composed over multiple classes if this helps in your case.
I assume you are able to obtain a confusion matrix for both results:
- For each result, the true/false positive/negative rate, AUC, EER, etc., would be calculated for each individual class. Consequently, this leaves you will all those rates for all individual classes.
- For comparing your results: look at the distribution of those values, over both results. The distribution gives you an idea how well the results are for the classes within (e.g. average performance + performance spread). For direct comparison, you could compare the numeric values of mean/median and sd/mad performance - but looking at and comparing e.g. two boxplots generated for each results might be easier.