1

So if my dataset looks like this:

  names life_style instrument  times
0   sid   creative      piano    1.5
1  aadi   artistic     guitar    1.4
2  aman  traveller       drum    1.1
3   sid   artistic     guitar    1.5
4  aadi   creative       drum    1.4

Now i want to deal with those Nominal categorical variables , Easy and go to approach is use Label encoding , But suppose if i am using sklearn label encoder then:

from sklearn.preprocessing import LabelEncoder
big_data = dataset_pd.apply(LabelEncoder().fit_transform)

which will output:

   names  life_style  instrument  times
0      2           1           2      2
1      0           0           1      1
2      1           2           0      0
3      2           0           1      2
4      0           1           0      1

Now it is converting each column but each column have same numeric values range from 0 to 5. The instrument variable is now similar to 'names' variable since both will have similar data points, which is certainly not a right approach. I have few questions :

  • How should i treat those values without loosing information ? better approach for this type of data points ?

  • I am thinking to use random forest for this type of data , Any suggestion on model also will be helpful for me.

  • If some variable are appearing only once in huge dataset should i remove those variables?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Aaditya Ura
  • 263
  • 1
  • 3
  • 8

1 Answers1

1

How to encode a nominal categorical variable? In part that will depend on which kind of model/algorithm you will use later, but the standard advice can be found here Where to find a guide to encoding categorical features?. If you have a categorical variable with many levels, see here Principled way of collapsing categorical variables with many levels?.

If you are using a random forest, it might well be that the program will take care of the encoding, or have special requirements as to how you give it the variables. More ideas here: Random Forest Regression with sparse data in Python

If some level occurs only once in the dataset, it might be a good idea to use regularization. Maybe that level occurs later in production while using the model!

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467