Information gain with numerical data

Question

I'm making a random forest classifier. In every tutorial, there is a very simple example of how to calculate entropy with Boolean attributes.
In my problem I have attribute values that are calculated by tf-idf schema, and values are real numbers.
Is there some clever way of applying an information gain function so it will calculate IG with real-number weights, or should I use discretization like:

0 = 0  
(-0 - 0.1> = 1  
(-0.1 - 0.2> = 2

etc.

EDIT
I have function:

$$ IG(X) = E(C) - E(C,A) $$

$$ E(C) = \sum\limits_{i=1}^C-P(c_i) * log(P(c_i)) $$

and
$$ E(C,A) = \sum\limits_{a\in A}P(a) * E(a) $$

The problem is iI have infinite number of possible values of $$ A $$ and i think, that I should perform dicretization of these values, shouldn't I?

I think this is a duplicate question. There are plenty of questions like that on CV. See this question here: http://stats.stackexchange.com/questions/95839/gini-decrease-and-gini-impurity-of-children-nodes/95842#95842 — rapaio, Jun 20 '14 at 10:29
My question is how can i use real-number attributes for calculating Information Gain. Normally, my attribute would have known number of possible values, here, i have a real numbers, and probably, ther won't be two documents with attribte a1 with same values. — kam, Jun 20 '14 at 11:01
My answer from the posted question explains how can you compute InfoGain for numerical attributes, like yours. — rapaio, Jun 20 '14 at 11:03
Ok, so if you have a classifier, you must have a target attribute, an attribute which has to be predicted. Then you have your numerical attribute, computed from tf-idf. You use **both** of them to compute InfoGain or any entropy related value. Not only the numerical value. Sort by numerical, consider each split value on numerical, then do the counts on target attribute for each split value. This is explained in my answer — rapaio, Jun 20 '14 at 11:09
This is the last try. The formula you use works if A is categorical/nominal variable. Thus a in A, means all values for variable A. If A is numeric, like it is your case, than you split A in two groups, one which have values less than a threshold, and the other group. Then you compute entropies for those groups: A_left and A_right for each possible threshold. Then your IG = E(C) - E(C,A_left) - E(C, A_right) — rapaio, Jun 20 '14 at 11:47
I understand that, but if attribute value is 0 - then term has't occured in document. So if i will split documents like: left_node = <0.5 right_node >=0.5, then in the left node i will have both documents with no occurance and with small weight. Shouldn't 0 be separte node in that case? — kam, Jun 20 '14 at 11:53
If tf-idf is 0 it looks to my like any other possible value. The idea is that the threshold value is changed and only the biggest value of IG and the corresponding threshold value is retain in order to perform the split. Additionally, if you have 2 docs with tf-idf equals 0, and 100 docs with a different value, I expect that the biggest value for IG to not be the one which separates the 2 docs from the other 100, but somewhere closer to the middle. — rapaio, Jun 20 '14 at 12:01
Ok thanks, I thought of 0 like it should be treated special(in Boolean representation it would be 0 or 1 and in tfidf 0 is always 0 and 1 is any real number), but i assume that tfidf is for that exacly, if weight is small it is less relevant then bigger weight and probably won't give any IG. Sorry for trouble. — kam, Jun 20 '14 at 13:51

samthebest · Answer 1 · 2014-06-20T12:50:15.783

Yes, you want to descretize your data. In fact it's generally good practice to do this in Machine Learning as it opens things up for more general algorithms.

The way to do it is an optimization problem, I guess you can think of it as a clustering problem too, you want to choose a bucketing of your real values such that the resulting categorical variables maximize the pairwise distance between the variables while ensuring the internal distance is minimal.

Now smoothing for the probability estimation will be very important here otherwise each bucket may end up with a tiny number of data points. Make sure you smooth aggressively towards 1/J for buckets which have a small number of examples in them. How to do this is unfortunately somewhat underdocumented in the literature and only Laplacian Smoothing (which is terrible, and has no justification at all) appears to be well known.

As a rule of thumb / rough hack, you could insist each bucket has some minimum number of data points, or you could use something like a p value.

Is smoothing really necessary? I mean, i got weights saved as real values. So I may create weights for interval. If weight of tf-idf is less than 0.1, then it has value 1, if less than 0.2 - value 2. It is more like rounding up than discretization. Will it work or should i just go back to Boolean model? — kam, Jun 20 '14 at 13:34
Smoothing is not necessary as long as you have a large number of examples for each bucket. — samthebest, Jun 20 '14 at 14:44

Information gain with numerical data

1 Answers1