How do I find fields related to a class in a naive bayesian classifier?

Question

I've built a naive bayesian classifier which classifies a binary category (either in or out).

I have a number of variables which the classifier gets trained on v1, v2, v3, v4, v5, etc. How do I identify which variables contribute more or less to the overall classification?

EDIT: I'm still working on this problem, but I think I've found an appropriate answer.

Information gain ratio. Wikipedia says: "Information gain ratio is between 0 and 1, cf. ordinary information gain". [1]

Which seems like exactly what I want. I can calculate IGR, and then use that to weight each feature. I also get a very intuitive list of which features are related to my target class, which I can use to make decisions in the real world.

[1] https://en.wikipedia.org/wiki/Information_gain_ratio#Advantages

I'm unaware of any approaches that "weight", in a continuous manner, the features in a naive Bayesian classifier. This should not be be required within this approach; the primary confounding feature that you would need to worry about is direct correlations between the feature variables themselves. So I'm not sure how to address this question; maybe you should sketch out what your overall plan is. — Dave, Feb 07 '13 at 13:03
I have had this problem as well: different probability tables (features) are used in naive bayes to classify, and you want to improve performance by weighting the feature tables as well as they merit. In my case, no tables other than the primary helped the accuracy because they blew away the main table. Had feature weights been possible, I’m sure the other features would have worked! — rjurney, Oct 31 '18 at 01:03

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

This is obviously an interesting (and useful) piece of information to have, and it's been studied extensively under the name "feature selection" (or ranking, etc). There are lots of schemes, many of which are classifier agnostic. That said, the Naive Bayesian classifier is a something of a special case in that this information is directly accessible, in the form of the conditional probabilities.

As you may recall, the classifier uses the probability $P(\textrm{class }| \textbf{ observations})$ to make classifications and, by Bayes Rule, that quantity is proportional to $P(\textbf{observations }| \textrm{ class}) \cdot P(\textrm{class})$. We make one more assumption, namely that $P(\textbf{observations }| \text{class})$ can be approximated by $\prod_i P(\textrm{observation}_i| \textrm{ class})$. (If this is unclear, I previously worked an example that some people seemed to like in this answer).

Those conditional probabilities are exactly what you're after! If they're similar when conditioned on all the classes, then that feature isn't contributing much to the classification; if they're quite different, then it is. Let's look at some of the data from the example linked above. We calculated the following probabilities as part of the training procedure:

P(feature1=A|class=red)=1
P(feature1=A|class=green)=1/8
P(feature1=B|class=red)=0
P(feature1=B|class=green)=7/8

This suggests that feature1 is very useful for separating the red and green classes. We would less enthralled if we saw some values like this:

P(feature1=A|class=red)=0.499
P(feature1=A|class=green)=.501
P(feature1=B|class=red)=0.501
P(feature1=B|class=green)=0.499

though obviously it depends on the variability associated with each value. Note that this method can also reveal some potentially interesting things about our data. For example, if we saw something like:

P(feature1=A|class=red)=0.5
P(feature1=A|class=green)=0.5
P(feature1=B|class=red)=0.0
P(feature1=B|class=green)=1.0

This would tell us that observing 'A' for feature 1 provides a lot of information, but observing 'B' doesn't help. For example, observing a beard is an excellent way to tell a man from a woman, but not observing a beard doesn't help at all. In this case, you'd need to consider how often you observe $A$ overall.

The preceding suggestions are all pretty informal. You could attempt to quantify the impact of your variables in several ways. One simple idea would be to use the Kullback-Leibler Divergence. KL Divergence is a simple measurement of the dissimilarity between two probability distributions. It is zero if they're identical and increasingly large as they become more dissimilar. It is dead-simple to compute: $$D_{KL}(P || Q) = \sum_x \ln \frac{P(x)}{Q(x)}P(i) $$ $$D_{KL}(P || Q) = \int_{-\infty}^{\infty} \ln \frac{P(x)}{Q(x)}P(x)dx$$ (depending on your distributions) and it has some handy information-theoretic properties too. In this case, you'd want to compute $D_{KL}(P(\textrm{feature}_i | \textrm{class = red}) || P(\textrm{feature}_i | \textrm{class = green})$ for each of your $i$ features, rank them, and take the highest-ranking ones. This works, but there are a few caveats. In particular $D_{KL}(P||Q) \ne D_{KL}(Q||P)$, so be careful how you implement this, particularly for multi-class problems. I suppose you could also flip this around and compute the divergence of class labels for given values of a feature, though that might be a little odd unless you have many different class labels.

Finally, there are also some other approaches for comparing distributions. You could use something like Kolmogorov–Smirnov (for continuous data) or Chi-Squared test (discrete) to test whether the conditional probabilities are equal. However, keep in mind that statistically insignificant results can still contribute to your classifier's performance, and conversely, significant differences may not have a huge impact if the difference is very small.

This, I think is exactly what I wanted. I'll have to think a bit about how to implement it. KL-divergance also looks promising. If I find the time I'll try out both approaches. — , Dec 18 '12 at 00:52
Thanks! You should also look at the prior probabilities (i.e., the P(class=classA) term). They often do a surprising amount of the work, which can be a problem if you're trying to identify a rare class. — Matt Krause, Dec 18 '12 at 04:56
Matt do you have a bit of time free to go into this a little more? My stats knowledge isn't strong here. — , Feb 03 '13 at 22:40
Also, I'm not sure I'd use these values to put weights on your features since the weights are already implicitly inside the classifier. If you want to improve its performance, I'd consider eliminating correlated features or, if you've got enough data, estimating the covariance matrix and including it. — Matt Krause, Feb 13 '13 at 21:34
How do you apply weights to different tables of features? Like P(A|B), P(A|C,D). P(A|B) might be P(job|previous job). P(A|C,D) could be P(job|major,degree). 1 item/probability/feature from each table is used. Multiplication of tables by a constant has no effect. They are being multiplied anyway, a constant does nothing. Say you guess weight(that P(A|B)) = 0.8 and weight(P(A|C,D)) = 0.2. How to apply those weights? Ran large scale sims trying out many features and could never find a single one that helped. I got stuck. I'm sure many would have helped in a random forest with feature weights. — rjurney, Oct 31 '18 at 01:26
Hello, this is an old topic but I came here from google so it's worth bringing it back from dust :) My understanding of probability is not very strong and I am confused about the third set. Why don't conditional probabilities add to 1, e.g. P(feature1=A|class=red) + P(feature1=B|class=red) < 1.0 and P(feature1=A|class=green) + P(feature1=B|class=green) < 1.0 in your example. — Celdor, Sep 13 '19 at 14:40

Dave · Answer 2 · 2013-02-07T13:09:56.207

1

I would use the KL-Divergence between the feature values conditioned on the class labels, i.e.

$KL[ p(v_i | in) || p(v_i | out)]$

By itself the KL divergence provides an ordering on the features; thus if there are external constraints on the number of features that we want to include in the model, e.g. the entire data set has hundreds of features but we only want a model with 10, then it would be reasonable to keep the 10 features with the highest KL divergence.

edited Feb 07 '13 at 13:09

answered Dec 17 '12 at 20:01

Dave

3,109
15
23

1

Thank's for the suggestion Dave. How would you weight the feature once you've got that result? (I've gone through a few approaches, and now I'm circling back to KL) – Feb 07 '13 at 01:49

score 1 · Answer 3 · edited Sep 29 '15 at 20:41

1

Kullback–Leibler (KL) divergence, or Information Gain, can be one solution, which gives quantitative estimation of how correlated the target and feature are. I also use log odds ratio to see how the odds ratio changes as the feature value changes, which leads to feature preprocessing techniques that can be applied to the feature.

edited Sep 29 '15 at 20:41

Silverfish

20,678
23
92
180

answered Dec 18 '12 at 05:12

Chunming Wang

71
7

How do I find fields related to a class in a naive bayesian classifier?

3 Answers3

Linked