4

Hi have dataframe with large categorical values over 1600 categories is there any way I can find alternatives so that I don't have over 1600 columns.

I found this below interesting link http://amunategui.github.io/feature-hashing/#sourcecode

But they are converting to class/object which I don't want. I want my final output as a dataframe so that I can test with different machine learning models?

Is there anyway I can implement ?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
vinaykva
  • 389
  • 1
  • 4
  • 19
  • 1
    What are you using the categorical values for? input to another classifier? or you directly want to use the 1-of-1600 categories? or ... ? – Hugh Perkins Feb 13 '18 at 14:02
  • @HughPerkins type of service customer enrolled for. – vinaykva Feb 13 '18 at 15:27
  • that answers the question 'what do the categories represent?' but not my intended question of 'how are you going to use the output of this network?' – Hugh Perkins Feb 14 '18 at 01:55
  • 1
    I like to consider my output variable and then look at one-hot ecoding subsets using variable importance. I have been able to get columns that would have O(nCr(20,3)) candidates for one-hot into being 3 columns in this way though it was very data dependent. The variable importance saw ~1100 columns, but then returned the important ones, which were 3 columns. That was much more human-understandable, and substantially improved prediction quality. – EngrStudent Oct 15 '18 at 16:17
  • If you use sparse matrices 1600 columns isn't a problem – kjetil b halvorsen Dec 18 '18 at 23:51

0 Answers0