This is obviously an interesting (and useful) piece of information to have, and it's been studied extensively under the name "feature selection" (or ranking, etc). There are lots of schemes, many of which are classifier agnostic. That said, the Naive Bayesian classifier is a something of a special case in that this information is directly accessible, in the form of the conditional probabilities.
As you may recall, the classifier uses the probability $P(\textrm{class }| \textbf{ observations})$ to make classifications and, by Bayes Rule, that quantity is proportional to $P(\textbf{observations }| \textrm{ class}) \cdot P(\textrm{class})$. We make one more assumption, namely that $P(\textbf{observations }| \text{class})$ can be approximated by $\prod_i P(\textrm{observation}_i| \textrm{ class})$. (If this is unclear, I previously worked an example that some people seemed to like in this answer).
Those conditional probabilities are exactly what you're after! If they're similar when conditioned on all the classes, then that feature isn't contributing much to the classification; if they're quite different, then it is. Let's look at some of the data from the example linked above. We calculated the following probabilities as part of the training procedure:
- P(feature1=A|class=red)=1
- P(feature1=A|class=green)=1/8
- P(feature1=B|class=red)=0
- P(feature1=B|class=green)=7/8
This suggests that feature1 is very useful for separating the red and green classes. We would less enthralled if we saw some values like this:
- P(feature1=A|class=red)=0.499
- P(feature1=A|class=green)=.501
- P(feature1=B|class=red)=0.501
- P(feature1=B|class=green)=0.499
though obviously it depends on the variability associated with each value. Note that this method can also reveal some potentially interesting things about our data. For example, if we saw something like:
- P(feature1=A|class=red)=0.5
- P(feature1=A|class=green)=0.5
- P(feature1=B|class=red)=0.0
- P(feature1=B|class=green)=1.0
This would tell us that observing 'A' for feature 1 provides a lot of information, but observing 'B' doesn't help. For example, observing a beard is an excellent way to tell a man from a woman, but not observing a beard doesn't help at all. In this case, you'd need to consider how often you observe $A$ overall.
The preceding suggestions are all pretty informal. You could attempt to quantify the impact of your variables in several ways. One simple idea would be to use the Kullback-Leibler Divergence. KL Divergence is a simple measurement of the dissimilarity between two probability distributions. It is zero if they're identical and increasingly large as they become more dissimilar. It is dead-simple to compute:
$$D_{KL}(P || Q) = \sum_x \ln \frac{P(x)}{Q(x)}P(i) $$
$$D_{KL}(P || Q) = \int_{-\infty}^{\infty} \ln \frac{P(x)}{Q(x)}P(x)dx$$
(depending on your distributions) and it has some handy information-theoretic properties too. In this case, you'd want to compute $D_{KL}(P(\textrm{feature}_i | \textrm{class = red}) || P(\textrm{feature}_i | \textrm{class = green})$ for each of your $i$ features, rank them, and take the highest-ranking ones. This works, but there are a few caveats. In particular $D_{KL}(P||Q) \ne D_{KL}(Q||P)$, so be careful how you implement this, particularly for multi-class problems. I suppose you could also flip this around and compute the divergence of class labels for given values of a feature, though that might be a little odd unless you have many different class labels.
Finally, there are also some other approaches for comparing distributions. You could use something like Kolmogorov–Smirnov (for continuous data) or Chi-Squared test (discrete) to test whether the conditional probabilities are equal. However, keep in mind that statistically insignificant results can still contribute to your classifier's performance, and conversely, significant differences may not have a huge impact if the difference is very small.