7

In the book “Programming Collective Intelligence” Segaran explains the Fisher method for categorizing text as an alternative to Naive Bayes classifier. The Fisher method uses inverse-chi-square-distribution, which I do not really understand.

I watched this video found on stats.stackexchange about chi-square-distribution to understand at least the “forward” function: http://www.youtube.com/watch?v=dXB3cUGnaxQ

Segaran explains in his book that they use inverse chi-square to somehow get a probability “that a random set of probabilities would return such a high number”. With high number he means that an item fitting a specific category has many features with high probabilities in that category. Somehow he also seems to take into account that “if the probabilities were independent and random, the result of this calculation would fit a chi-squared distribution”. But as he mentioned before the words are not independant (which is also a false assumption at naive bayes). So how does this then work?

And if I understand it right now, the inverse chi-square function somehow checks if many of my words have a high probability of being in the text and only if all words have such a high probability it returns a high over-all probability?

I’m sort of confused.

PS: The whole paragraph: “Fisher showed that if the probabilities were independent and random, the result of this calculation would fit a chi-squared distribution. You would expect an item that doesn’t belong in a particular category to contain words of varying feature probabilities for that category (which would appear somewhat random), and an item that does belong in that category to have many features with high probabilities. By feeding the result of the Fisher calculation to the inverse chi-square function”, ou get the probability that a random set of probabilities would return such a high number.”

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
aufziehvogel
  • 203
  • 1
  • 5
  • 1
    The quote seems a little confused, since the inverse of the chi-square returns a quantile, not a probability. The input to the inverse of the CDF is a probability, not the output. – Glen_b Oct 29 '13 at 22:58
  • 3
    The likely intent is [Fisher's method for combining p-values](http://en.wikipedia.org/wiki/Fisher%27s_method) into a single overall p-value – Glen_b May 01 '14 at 04:19

1 Answers1

1

This document extensively answers your question : Why Chi?, Motivations for the Use of Fisher's Inverse Chi-Square Procedure in Spam Classification, by Gary Robinson

Bastien R
  • 11
  • 1
  • 7
    Welcome to the site, @BabOuSunshine. Would you mind giving a precis of the info found in that document? This will help readers decide if it's what they're looking for, & may help in case of linkrot. – gung - Reinstate Monica May 02 '13 at 14:56