How to deal with Nominal categorical with label encoding?

Question

So if my dataset looks like this:

  names life_style instrument  times
0   sid   creative      piano    1.5
1  aadi   artistic     guitar    1.4
2  aman  traveller       drum    1.1
3   sid   artistic     guitar    1.5
4  aadi   creative       drum    1.4

Now i want to deal with those Nominal categorical variables , Easy and go to approach is use Label encoding , But suppose if i am using sklearn label encoder then:

from sklearn.preprocessing import LabelEncoder
big_data = dataset_pd.apply(LabelEncoder().fit_transform)

which will output:

   names  life_style  instrument  times
0      2           1           2      2
1      0           0           1      1
2      1           2           0      0
3      2           0           1      2
4      0           1           0      1

Now it is converting each column but each column have same numeric values range from 0 to 5. The instrument variable is now similar to 'names' variable since both will have similar data points, which is certainly not a right approach. I have few questions :

How should i treat those values without loosing information ? better approach for this type of data points ?
I am thinking to use random forest for this type of data , Any suggestion on model also will be helpful for me.
If some variable are appearing only once in huge dataset should i remove those variables?

Usually one would do `OneHotEncoding` to escape this problem. — hellpanderrr, Jul 30 '18 at 22:38

kjetil b halvorsen · Answer 1 · 2020-08-26T15:13:03.790

How to encode a nominal categorical variable? In part that will depend on which kind of model/algorithm you will use later, but the standard advice can be found here Where to find a guide to encoding categorical features?. If you have a categorical variable with many levels, see here Principled way of collapsing categorical variables with many levels?.

If you are using a random forest, it might well be that the program will take care of the encoding, or have special requirements as to how you give it the variables. More ideas here: Random Forest Regression with sparse data in Python

If some level occurs only once in the dataset, it might be a good idea to use regularization. Maybe that level occurs later in production while using the model!

How to deal with Nominal categorical with label encoding?

1 Answers1

Linked