2

Given a set of extracted data from different sources with different accuracies, how can I combine the accuracy of those who give the same output?

Example :

Data from source A are 80% correct
Data from source B are 85% correct
Data from source C are 90% correct

If two of the sources give the same result (ResultA) and the third disagrees (ResultB) what's the probability of (A) being correct? This is not a homework question. I am a software developer and I don't have a clue about statistics and probability.

Update :

I've done an experiment using a random number generator

Test 1 - 2 Possible outcomes (0/1) three methods (Acc: 0.5, 0.3, 0.1)

Samples      : 100000000
Method A     : 0,49993692
Method B     : 0,30023622
Method C     : 0,09994145
Method A+B   : 0,794567779569577
Method B+C   : 0,0455372643070089
Method C+A   : 0,205615801945512
Method A+B+C : 0,0455215295368209

Test 2 - 2 Possible outcomes (0/1) three methods (Acc: 0.8, 0.85, 0.9)

Samples      : 100000000
Method A     : 0,80003639
Method B     : 0,8500426
Method C     : 0,90005791
Method A+B   : 0,715942797491352
Method B+C   : 0,927408281972288
Method C+A   : 0,864147967527417
Method A+B+C : 0,995137034088319

That's the numbers I am looking for but I don't know how to calculate them...

2 Answers2

1

So if I assume there is a single binary state (0 or 1) that is measured three times, where the three observations are independent, we can calculate the probability of each state. This is a likelihood-based analysis without any real equations.

So for your example, A says 1, B says 1, and C says 0. The two choices are it's really 0 or it's really 1.

If it's truly 0, the independent observation probabilities multiply like:

(1-0.80) * (1-0.85) * 0.90 = 0.027

If it's truly 1, the independent observation probabilities multiply like:

0.80 * 0.85 * (1-0.90) = 0.068

The idea is that each measurement was either correct (p) or incorrect (1-p) and contributes that term to the product.

We have to normalize these quantities to give a probability as the final answer:

The probability it's truly 1 is then 0.068 / (0.027 + 0.068) = 0.716.

The main idea is that you use your observation model to evaluate the two competing hypotheses (which state generated the data), and report a normalized probability.

matted
  • 26
  • 1
  • Thanks for your answer. It seems that your calculation gives the opposite result from my experiment. Do you have any idea why is that? – testificate Jul 26 '12 at 17:18
  • How is it the opposite result? Your empirical calculation of "method A+B" gives 0.7159, where my exact calculation gives (rounded) 0.716. – matted Jul 27 '12 at 02:41
  • Indeed, I failed to spot it at first! Now I understand your logic and I have a working implementation. Is there a generalized formula that can handle more than 2 possible outcomes? – testificate Jul 27 '12 at 12:06
0

The above answer corresponds to a special case of a more general answer. First of all, it makes certain assumptions regarding the prior distribution of the true state and conditional independence assumptions, as noted by matted.

I'll try to provide the answer for the general case, showing where these assumptions come into picture. Assuming each data source is a noisy channel which outputs the true value with some probability, the question can be formulated as follows:

$P^* = P(y=1 | \hat{y}_A=1, \hat{y}_B=0, \hat{y}_C=1)=$ ?

Using Bayes rules, we can expand this conditional probability:

$P^* = \frac{P(y=1,\hat{y}_A=1, \hat{y}_B=0, \hat{y}_C=1)}{P(\hat{y}_A=1, \hat{y}_B=0, \hat{y}_C=1)}$

$P^* = \frac{P(y=1)P(\hat{y}_A=1, \hat{y}_B=0, \hat{y}_C=1 | y=1)}{P(y=1)P(\hat{y}_A=1, \hat{y}_B=0, \hat{y}_C=1 | y=1) + P(y=0)P(\hat{y}_A=1, \hat{y}_B=0, \hat{y}_C=1 | y=0)}$

The formula above can further be simplified by assuming that the observations from these noisy channels are conditionally independent from each other given the true value. Please note that this is a weaker assumption than saying noisy observations are marginally independent; hence less restrictive. The conditional independece assumption proves to be quite useful since it allows multiplication:

$P^* = \frac{P(y=1)\prod_{x\in\{A,B,C\}}P(\hat{y}_x | y=1)}{\sum_{\bar{y}\in\{0,1\}}P(\bar{y})\prod_{x\in\{A,B,C\}}P(\hat{y}_x | \bar{y})}$

The above solution is a special case of this since it assumes a uniform prior distribution of y, e.g. $P(y=1)=0.5$. Also note that each noisy output generation is a Bernoulli trial; e.g. $P(\hat{y}_x | y) = P_x^{I(\hat{y}_x = y)}(1-P_x)^{I(\hat{y}_x \neq y)}$ $\forall x\in\{A,B,C\}$ where $P_x$ is the probability of generating the true value of $y$ for the noisy channel $x$, and $I(s)$ is the identity function which returns 1 if the statement $s$ is true, and 0 otherwise.