2

I have a categorical variable, Industry, that has different values in a dataset that is over 400K datapoints. This dataset is highly imbalanced, the ratio of roughly 99/1. What I am doing is significantly undersampling the majority class to create a 50/50 dataset of 8000 datapoints (I definitely lost most of the training points because I need most of those to have a prediction made on them, though those are mainly from the majority class so I am not losing too many rare class data).

What I would like to do is mean encode this variable, Industry, against the target. For example, let's say I have NYC appear 1000 times in the training data, and 100 of them are in the positive class the value for datapoints with NYC will be 100/1000=.1 will be the value of this new feature, etc.

The problem is that because I am undersampling to make my dataset balanced, that skews this ratio by creating fake good ratios on the minority class because it now represents 50% of my training data rather then 1%. Hence, my training column will significantly overvalue the ratio compared to the testing set and will not help my machine learning algorithms at all.

What should I do in this case to create a good mean encoded ratio?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
bernando_vialli
  • 161
  • 1
  • 9
  • 1
    Have you tried modeling with the raw data and not downsampling? – Matthew Drury Jul 16 '18 at 19:32
  • yes, that gave me rather poor results because it had a rather hard time identifying the rare class. Now what I currently have is I am actually overfitting this variable that I am asking about because I basically used a computed ratio from the entire combined training & testing datasets but of course that is data leakage and not the right way to build a model – bernando_vialli Jul 16 '18 at 19:36
  • 3
    I suspect that this is enormously relevant: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) – Stephan Kolassa Jul 16 '18 at 19:41
  • well I am looking at other metrics besides accuracy to look at how well it did like the ROC Score, Confusion Matrix scores (aka specificity, sensitivity, etc) – bernando_vialli Jul 16 '18 at 19:44
  • 2
    Once you had fit the probability model, how did you use it to assign classes? It's common to see people use a threshold of 0.5, under some belief that it is optimal, but this is ALWAYS a bad idea. You need to tune your classification threshold in all cases. – Matthew Drury Jul 16 '18 at 19:44
  • 2
    @mkheifetz don't get fooled by the title of that question, the advice in Stephan's answer is much, much more general. Please read it carefully. – Matthew Drury Jul 16 '18 at 19:45
  • That's how I assigned them yes, but I compared how the different values did using the predict_proba in scikit-learn. I think for example, everything that scores roughly 70% and above is very good, everything below sucks, the 50-70% score could be better in the sense that the success rate in that interval is lower then the average success ratio (success being one of the two outcomes) so what I currently have is (well I hope because I don't have anything in production) I think is somewhat good, but I know that I am currently leaking a variable that is useful in my data which I know is wrong – bernando_vialli Jul 16 '18 at 19:47
  • ok I will read it carefully. Regarding your other comment, what you are referring to as setting the probability to a higher threshold is the equivalent of just viewing the results of predict_proba, do I understand that correctly? – bernando_vialli Jul 16 '18 at 19:49

1 Answers1

1

I hope you have followed the good advice in the comments. Why do you use downsampling? It is most often used to solve a nonproblem. See Why downsample? and many of its answers. With only 400k rows memory shouldn't be a problem, if it is, get some better software. Their problem may be the use of accuracy, which is an improper score function, see Is accuracy an improper scoring rule in a binary classification setting?.

Then the question about target (or mean) encoding. That is an idea from machine learning used with categorical variables with very many levels. Your variable Industry probably does not have that many levels, so you could try other ideas, like dummy variables with regularization, maybe try glmnet. Glmnet uses sparse matrices so many levels isn't a big problem. If there is many levels, see some of the ideas here: Principled way of collapsing categorical variables with many levels?.

Finally, if you still go for target encoding, see my answer here: Strange encoding for categorical features

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467