2

For instance I have a set of categories from a variable called 'category'. From python, here are my categories and their corresponding counts.

Category_Others        1430
Application/Website    1345
Backups                900
Software/OS            800
Hardware/Virtual       700
Network                302

There are two options I am considering; however, I am not sure which one is more legit.

I assign a value to each category as follows:

Category_Others ---- 0      
Application/Website ---- 1
Backups ---- 2               
Software/OS ---- 3        
Hardware/Virtual ---- 4     
Network ---- 5

Or I introduce 5 new variables into my data frame. Then I end up with 5 more columns (i.e. columns with names of each category)

i.e. on Python

category = pd.get_dummies(incident['category'], drop_first=True)   
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
lusicat
  • 681
  • 1
  • 10
  • 20
  • Here is a good blog post: https://blog.myyellowroad.com/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512 – FatihAkici Apr 06 '18 at 20:09

3 Answers3

6

The first approach (assigning 0 to 5) is not recommended. Because that encoding indicates Network is 5 times than Application, which is not true. Or this encoding imposes additional constraints to the model.

In machine learning, people usually use one-hot encoding, which (similar to) your second approach, but many people do not use drop_first=True

There is another small issue on one-hot encoding: for a discrete variable has 6 possible values, should we encode with 6 columns or 5 columns?

People in statistics community usually encode with 5 columns. Many people in machine learning will use default setting in python, which does not have drop_first=True.

If we encode it with 6 columns there are linear dependencies in your design matrix and it will cause some problems in some model. But python usually use regularization on linear model by default, so, it is fine there.

Finally want to give you some statistical way of encoding in addition to machine learning way.

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • +1 See my answer here: https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 for more about this issues. – kjetil b halvorsen Apr 06 '18 at 18:28
  • 5 vs. 6 columns and the question whether they need to sum up to 1 or not: this actually corresponds to whether your categories are mutually exclusive or not (how do you want to treat a Backup of a Website?). Sensible choice will depend on your application. – cbeleites unhappy with SX Apr 06 '18 at 18:51
  • 1
    @cbeleites agreed. thanks for the comment. Also for the link I provide in my answer. all of the are the same from computational perspective, but with different basis / interpretations. – Haitao Du Apr 06 '18 at 18:52
5

I would add to the answer that it depends on your algorithm of choice. For example, if your feature set consists of a mix of categorical and numeric variables, and you want to use a tree based algorithm, then one-hot encoding could hurt you, compared to leaving them categorical. This is because:

  1. Tree based methods understand categories (they don't treat category 4 as twice as large as category 2, they'll just split on category < 4 or > 4)

  2. One hot encoding means that certain binary variables may be split higher up the tree than they would otherwise, and the resulting sub-tree may then split on numerical variables which could result in a leaf node being less pure (less homogeneity in classes) than it would had it split on the categorical value. See here for a nice write up:

    https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

Finally, this site has a bunch of other answers related to this topic that could help provide more information.

ilanman
  • 4,503
  • 1
  • 22
  • 46
0

We can also use discriminant analysis, which basically deals with classification problems with many levels of classification.