4

I have a dataframe with around 37,000 rows and 54 columns. Out of these 54 columns two columns namely 'user_id' and 'mail_id' are provided in avery creepy format as shown below:

user_id                                           mail_id       
AR+tMy3H/E+Re8Id20zUIz+amJkv6KU12o+BrgIDin0=      DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I=
1P4AOvdzJzhDSHi7jJ3udWv4ajpKxOn4T/rCLv4PrXU=      BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
OEfFUcsTAGInCfsHuLZuIgdSNtuNsg8EdfN98VUZVTs=      BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=   
1P4AOvdzJzhDSHi7jJ3udWv4ajpKxOn4T/rCLv4PrXU=      EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs=
CYRcuV0cR0algMZJ1N6+3uKcqi8iu+6tJNzmBbmgN7o=      K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q=

After a detailed analysis of my data I figured out that I cannot drop these two columns from my dataframe as they are too importanct for prediction. I can hash these two features but there is one more interesting thing. There are only 2,000 types of user_ids and mail_ids. So doing one hot encoding can help a lot. My question is that if I convert this into one hot encoding using 'get_dummies' method in pandas with sparse=True, will it be memory efficient or is there any other efficient way to do it?

enterML
  • 284
  • 2
  • 12

1 Answers1

5

That "creepy" format is just a form of anonymization - in Python you can use base64 lib to b64encode('hi my name is derek') and get aGkgbXkgbmFtZSBpcyBkZXJlaw== as output. You'll notice the similarity to the above.

When I use hashlib and do base64.b64encode(hashlib.sha1('derek').hexdigest()). I get my name hashed and encoded as b64 - likely what you have above. Might be fun to experiment and see if you can b64.decode(user_name) and get anything useful out of it (unlikely since SHA1 and other popular hashes are one-way).

But anyway, on to your point because that was a tangent:

Yeah, you can hash those together and use pandas.get_dummies if you like. I usually use sklearn for this type of thing, and I like to work within that ecosystem more than with pandas. Either will be equal from a memory standpoint - both implementations use the sparse=True param to indicate that they want to use a numpy sparse matrix instead of a full featured numpy array under the hood.

Sparse matrices are as good as it gets for one-hot encoding problems!

tdy
  • 313
  • 7
Derek Janni
  • 464
  • 2
  • 5