2

I have k models M1, M2, .... , Mk. Each of these models output a probability distribution over L classes C1, C2, .... , CL. I also have weights for each model w1, w2, ... , wk such that sum(w1, w2, ... , wk) = 1. In other words, the weights represent how much "confidence" I have in the output of the models. The objective is to combine the probability distributions from the k models into one single probability distribution. This can easily be accomplished by multiplying the individual probability distributions with the model weights. In other words

$$ Combined Probability Distribution = \sum_{i = 1}^{k}w_im_i $$

where $ m_i $ is the probability distribution for model $ M_i $ over classes $ C_j $.

For example, let's say we have three models M1, M2, and M3, with the following weights and probability distributions over four classes

Model Weight Probability Distribution
M1 0.7 0.90, 0.05, 0.05, 0.00
M2 0.2 0.80, 0.10, 0.05, 0.05
M3 0.1 0.70, 0.15, 0.10, 0.05

This gives an overall probability distribution of 0.7*[0.90, 0.05, 0.05, 0.00] + 0.2*[0.80, 0.10, 0.05, 0.05] + 0.1*[0.70, 0.15, 0.10, 0.05] = [0.86, 0.07, 0.055, 0.015]

Question: What do I do when one of the models, say M1 in the example above, is missing this probability distribution for some samples? In other words, I only have probability distributions from M2 and M3 for some samples, and all probability distributions for other samples?

Simply combining the distributions from M2 and M3 and renormalizing it is not correct, since the weights for M2 and M3 are very low. One option is to assume that all classes get the same weight for M1 and then do the same computation, like so

0.7*[0.25, 0.25, 0.25, 0.25] + 0.2*[0.80, 0.10, 0.05, 0.05] + 0.1*[0.70, 0.15, 0.10, 0.05] = [0.405, 0.21, 0.195, 0.19]

Is this a good way, statistically speaking? Are there better ways to accomplish this?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Data Max
  • 31
  • 3

2 Answers2

2

First of all, if you know that for one category the estimated probability is 0.9, and for the two others it is 0.05, then assuming that they all are uniform is completely contradictory to your data. Worse, you give those results the highest (0.7) weight among all the results, so they will overwhelm the result. It would probably be wiser to just ignore those results, rather than replacing them with arbitrary values.

Since you didn't give us details on why you have missing data in the estimated probabilities, let me make an educated guess. I assume that you have several models, trained on different data sets (or re-samples of the same data), which resulted in not having all the classes represented equally in the training sets. As a result, some of the categories were not predicted by the models, which is equivalent to models assigning probabilities equal to zero for the categories. What can be done to prevent such cases is using a Bayesian approach, where using a non-zero prior, would lead to non-zero posterior probabilities. A simple example of such an approach is using Laplace smoothing for preprocessing the data for the naive Bayes model. You can also use it to smooth the predicted probabilities

$$ \tilde p_i = \frac{p_i + \alpha}{\sum_j p_j +\alpha} $$

where $\alpha \in (0, 1)$ is a small value. If those were counts, it would be a pseudo-count. To transform probability into a count you would take $Np_i = n_i$, so the pseudo-count would be $N\alpha$. This would help you get rid of the zero probabilities and replace them with small, non-zero probabilities that may be more reasonable.

As a side note, using weighted average is not the only way to combine the probabilities as you can learn from the Combining probabilities/information from different sources thread.

Tim
  • 108,699
  • 20
  • 212
  • 390
0

Your "models" appear to be several empirical distributions, perhaps each obtained from small data samples.

Question: What do I do when one of the models[datasets], say M1 in the example above, is missing this probability distribution for some samples [has zero observations for some classes]?

You could sum observations across the multiple data sets to obtain an aggregated empirical distribution: $$ \hat{p}_\text{class} = \frac {\sum\limits_{\text{dataset}=1}^K \text{#obs}_\text{class,dataset}}{\sum\limits_{\text{class}=1}^L\sum\limits_{\text{dataset}=1}^K \text{#obs}_\text{class,dataset}} $$

This aggregation could incorporate a Bayesian prior if desired.

krkeane
  • 781
  • 3
  • 6