I have a dataframe with around 37,000 rows and 54 columns. Out of these 54 columns two columns namely 'user_id' and 'mail_id' are provided in avery creepy format as shown below:
user_id mail_id
AR+tMy3H/E+Re8Id20zUIz+amJkv6KU12o+BrgIDin0= DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I=
1P4AOvdzJzhDSHi7jJ3udWv4ajpKxOn4T/rCLv4PrXU= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
OEfFUcsTAGInCfsHuLZuIgdSNtuNsg8EdfN98VUZVTs= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
1P4AOvdzJzhDSHi7jJ3udWv4ajpKxOn4T/rCLv4PrXU= EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs=
CYRcuV0cR0algMZJ1N6+3uKcqi8iu+6tJNzmBbmgN7o= K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q=
After a detailed analysis of my data I figured out that I cannot drop these two columns from my dataframe as they are too importanct for prediction. I can hash these two features but there is one more interesting thing. There are only 2,000 types of user_ids and mail_ids. So doing one hot encoding can help a lot. My question is that if I convert this into one hot encoding using 'get_dummies' method in pandas with sparse=True, will it be memory efficient or is there any other efficient way to do it?