Feature construction and normalization in machine learning

Question

Lets say I want to create a Logistic Classifier for a movie M. My features would be something like age of the person, gender, occupation, location. So training set would be something like:

Age Gender Occupation Location Like(1)/Dislike(0)
23 M Software US 1
24 F Doctor UK 0

and so on.... Now my question is how should I scale and represent my features. One way I thought: Divide age as age groups, so 18-25, 25-35, 35-above, Gender as M,F, Location as US, UK, Others. Now create a binary feature for all these values, hence age will be having 3 binary features each corresponding to an age group and so on. So, a 28 years Male from US would be represented as 010 10 100 (010-> Age Group 25-35, 10 -> Male, 100 -> US)

What could be the best way to represent features here ? Also, I noticed in some e.gs. of sklearn that all the features have been scaled/normalized in some way, e.g. Gender is represented by two values, 0.0045 and -.0.0045 for Male and female. I don't have any clue on how to do scaling/mormalization like this ?

Its not clear to me why you want to scale your features? Often features are normalized to have 0 mean unity std dev. You might need to define the problem in terms of what the classes are that you are trying to classify, logistic regression is useful for binary classification. — BGreene, Oct 19 '12 at 20:19
You certainly don't want to categorize ages. How is "rating of movie" measured? Is it a 1 to 10 scale, a "like/dislike" or what? — Peter Flom, Oct 19 '12 at 21:54
For simplicity, lets assume that there are only two classes, Like and Dislike. Like being 1 and Dislike being 0. Have changed the problem statement to reflect this. — snow_leopard, Oct 20 '12 at 04:24

Emile · Accepted Answer · 2012-10-23T08:52:09.917

Binary case

If you want your features to be binary, the good representations for categorical (resp. real) values are the one hot (resp. thermometer) encoding. You don't need to normalize them.

For the one hot encoding of a categorical feature, you simply reserve one bit for each class. The length of this encoding is therefore the number of classes of your feature. Lets take your example of country,

00001 for US
00010 for UK
00100 for Asia
01000 for Europe
10000 for other

For the thermometer encoding of a real/integer feature, you have to choose a length and the thresholds. For your example of age, you've chosen to split age according to the thresholds 18,25 and 35. The coding will be

000 for 0-17
001 for 18-25
011 for 25-34
111 for 35-above

Putting both together, you obtain here an encoding of size 5+3=8 bits. For a 30 year old UK resident we have $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0}^{UK}\cdot \overbrace{0 \cdot 1 \cdot 1 }^{30yo}$$

Continuous case

If your regression model allows it, you should prefere to keep a real value for a real/integer feature which contains more information. Let's reconsider your example. This time we simply let the value for age as an integer. The encoding for a 30 year old UK resident is thus $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0 }^{UK}\cdot \overbrace{30 }^{30yo}$$

As BGreene said, you should then normalize this value to keep a mean of 0 and a standard deviation of 1, which insure stability of many regression models. In order to do that, simply subtract the empirical mean and divide by the empirical standard deviation.

Y_normalized = ( Y - mean(Y) ) / std(Y)

If the mean of all the age of all persons in your data base is 25, and its standard deviation is 10, the normalized value for a 30y.o. person will be $(30-25)/10 = 0.5$, leading to the representation $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0}^{UK}\cdot \overbrace{0.5 }^{30yo}$$

Cool... so lets say we have an example person as follows: Country: UK, AgeGroup : 25-34. This will lead to values as **Country: 2**, **Agegroup: 4** if we use **one hot** encoding. Now while creating a feature vector we should normalize these. So lets say they come as 0.4 and 0.6, then our input feature vector to model essentially becomes [0.4, 0.6], correct ? — snow_leopard, Oct 22 '12 at 13:13
hmm.. if I use "one hot" encoding should I convert the encoding value to its Integer representation as a feature, e.g. 0010 becomes 2. OR should I treat this as a set of 4 features out of which only one will be ON ? In the former case does not it introduce a notion that 1000 is further to 0001 then 0100 which might not be the intention as we don't want US feature value to be closer to UK feature value then Asia value or something else ? — snow_leopard, Oct 22 '12 at 13:36
I edited my answer to clarify these points. You don't need to normalize binary features, and you have to treat them as a vector, don't convert them into an integer. — Emile, Oct 23 '12 at 08:55
@Emile, could you explain *why* you say it is unnecessary to normalize/scale binary features? — blacksite, Jun 11 '20 at 16:18

Feature construction and normalization in machine learning

1 Answers1

Binary case

Continuous case