1

I have a set of medical records, one of the columns has the name of 2000 different doctors. What do I need to do in order to convert these strings to numbers? I want to use a decision tree.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • 2
    Please don't say your question is urgent or ask people to answer quickly. Remember that you are asking strangers to volunteer their time to help you for free. People will respond at the rate that is comfortable for them. – gung - Reinstate Monica Mar 22 '20 at 12:55
  • 1
    Apparently, you seem to ignore moderator warnings. – gunes Mar 22 '20 at 13:08
  • See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels, https://stats.stackexchange.com/questions/410939/label-encoding-vs-dummy-variable-one-hot-encoding-correctness/414729#414729 – kjetil b halvorsen Mar 22 '20 at 13:44

1 Answers1

1

You’ll use two new binary features and use one hot encoding. For example, for Dr. A your features will be [1,0], and for Dr. B your features will be [0,1]. Assigning arbitrary numbers to each doctor is not the correct approach because it induces an implicit ordering, i.e. normally you don’t have Dr. A < Dr. B but depending on your assigned numbers, you’ll end up with comparable doctors, but you shouldn’t have.

Specifically for decision trees, most libraries don't require you to convert your categorical features into numeric. So, you should be able to use them as is.

gunes
  • 49,700
  • 3
  • 39
  • 75