Lets say I want to create a Logistic Classifier for a movie M. My features would be something like age of the person, gender, occupation, location. So training set would be something like:
- Age Gender Occupation Location Like(1)/Dislike(0)
- 23 M Software US 1
- 24 F Doctor UK 0
and so on.... Now my question is how should I scale and represent my features. One way I thought: Divide age as age groups, so 18-25, 25-35, 35-above, Gender as M,F, Location as US, UK, Others. Now create a binary feature for all these values, hence age will be having 3 binary features each corresponding to an age group and so on. So, a 28 years Male from US would be represented as 010 10 100 (010-> Age Group 25-35, 10 -> Male, 100 -> US)
What could be the best way to represent features here ? Also, I noticed in some e.gs. of sklearn that all the features have been scaled/normalized in some way, e.g. Gender is represented by two values, 0.0045 and -.0.0045 for Male and female. I don't have any clue on how to do scaling/mormalization like this ?