How does Dempster-Shafer relate to Machine Learning?

Question

I read Dempster-Schafer can be thought of as a generalization of Bayesian theory.

Say I have data from disparate sources that indicate the class of some object. If I have some prior beliefs about the world, or by examination of some corpus of training data, I can relate features to classes through conditional probabilities and do some reasoning. (Or is it the marginals I should be caring about?) Alternatively I can collect indicators together and pass them to a classifier, which in training will come to have its own prior beliefs.

Can Dempster-Schafer be thought of as a kind of learner, then, just as Bayes Nets are? Is there a context where I should prefer a Dempster-Schafer model or Bayes Net to a model not based on belief propagation? Is there a context where I should prefer the opposite?

I have never been extremely comfortable with belief propagation ("message passing"). I gather it is merely an efficient method to calculate marginal distributions, much as backpropagation just happens to be an efficient way to calculate gradients in a neural net. But what, in your words, do these marginal distributions mean? What are they for?

score 0 · Answer 1 · answered Sep 06 '18 at 18:59

As I understand it, Dempster-Shafer is more like a generalized Bayes Rule, using a generalized probability definition, rather than a learner or model. It also has possibly severe issues if not used properly:

Jøsang proved that Dempster's rule of combination actually is a method for fusing belief constraints. It only represents an approximate fusion operator in other situations, such as cumulative fusion of beliefs, but generally produces incorrect results in such situations. The confusion around the validity of Dempster's rule therefore originates in the failure of correctly interpreting the nature of situations to be modeled. Dempster's rule of combination always produces correct and intuitive results in situation of fusing belief constraints from different sources.

It's very intriguing, though!

I don't understand the Wikipedia quote. When is DS valid and invalid? Why? Just as Bayes Rule has a Net, could DS be implemented in a Net? Again, when should one prefer a message-passing model to alternative models? — Pavel Komarov, Sep 10 '18 at 13:01

Pavel Komarov · Accepted Answer · 2018-09-18T13:52:03.460

I've done some further study. It led me to some further questions, which I will try to answer over there. But let's address the ones here.

First, my journey: I found and read the well-written and surprisingly helpful An Evidential Reasoning Approach to Composite Combat Identification (CCID). (paywall) It gives a great explanation of the math underlying D-S, extends it out to something they call the Valuation-Based System, and walks through simple examples of how to propagate belief. It's much more informative than the Wikipedia article, which means I may have to author some updates. Here are a few key takeaways:

Dempster-Shafer is based on a method of writing probabilistic beliefs in a funky way, as "basic probability assignments" (BPAs), which are unfamiliar to those of us who like probability mass functions but can some times be expressed more compactly. For example, in the completely uncertain "vacuous" case, you can carry around just $m({set\ of\ all\ possibilities}) = 1$ rather than $p(possibility\ 1) = \frac{1}{N}, p(possibility\ 2) = \frac{1}{N}, ... p(possibility\ N) = \frac{1}{N}$. Whether either of these is really better is up for debate (Personally I prefer the pmf way.), but the point is you can construct beliefs from evidence (new examples, like sensor Z has identified some object to be an X or a Y with some confidence) or from training data (like I know that in 90% of cases when Z happens the cause was either an X or a Y).
Dempster's rule of combination is just a way to fuse beliefs of this form together, just like how in a pmf-based Bayesian world you might multiply things. In order for it to be valid, pieces of evidence have to be independent, and I delved in to what that means in this kind of context in a different question.
The Valuation-Based System is the analog of a Bayes Net, or at least uses a graph which is the analog. The authors claim there is an efficient algorithm for reasoning with this structure that can avoid operations in exponentially-large power-set space, which was a concern I had already thought of and raised over in my other questions.
You can use new evidence to update the "local" BPA at the node corresponding to the output the evidence is about with Dempster's rule. Then by finding combinations with neighboring nodes' BPAs and passing those on as temporary "messages" to other nodes, the information can be spread through the network so that the final temporary message fused at each node reflects the beliefs of the whole network. Then there is a simple "pignistic" transform to put that belief in terms of a pmf. Because the BPAs at those nodes and their corresponding pmfs integrate all information except for their one variable, they are marginal.

So, my questions:

Is it the marginals I should be caring about? What do the marginal distributions mean? What are they for?

$\qquad$Yes. Message passing is just an efficient way to calculate marginals. Marginals represent a unified picture of what the system believes some output variable should be. We care about these single-variable distributions because they can be used for all kinds of further decision-making. We don't need conditional distributions here because the model is already conditioned to account for evidence introduced earlier.

Can Dempster-Schafer be thought of as a kind of learner, then, just as Bayes Nets are?

$\qquad$Caveat. Bayes Nets aren't exactly learners, at least not in the sense I meant. They're graphical representations of variables and conditional dependencies. They're more in the category of Hidden Markov Models than in the category of Neural Nets. They don't really learn from examples in the same way a supervised learner might, but they can use examples to estimate conditional probabilities to be used later in reasoning (a kind of "training"). Bayes Nets are great for modeling known relationships and then making inferences about unobserved parts of the system or doing maximum likelihood estimation.
$\qquad$Yes and No. Dempster-Shafer is a rule much like Bayes Rule is a rule, but an analog to a Bayes Net is possible; the authors implemented it.

Is there a context where I should prefer a Dempster-Schafer model or Bayes Net to a model not based on belief propagation? Is there a context where I should prefer the opposite?

$\qquad$Yes. Belief propagation ("message passing") is an efficient way to reason with probability distributions which are related to each other in a way you can draw as a network. If your evidence or examples are better or more naturally expressed as probabilities over possible outputs and you know the topology of that network, then these models are for you. If you instead have examples which are more like points in space and don't know any relationship among them, then use something like supervised or unsupervised learning.

When is D-S valid and invalid? Why?

$\qquad$ It is valid to reason this way when dealing with probabilistic pieces of evidence that are statistically independent. It is invalid when pieces of evidence are not independent because the combination rule relies on independence and breaks without it.

[The answer to this question](https://ai.stackexchange.com/questions/7761/how-does-the-dempster-shafer-theory-of-evidence-differ-from-the-bayesian-reasoni) is also helpful. — Pavel Komarov, Sep 13 '18 at 21:58

How does Dempster-Shafer relate to Machine Learning?

2 Answers2

Linked