0

I have a simple dataset with balanced target y (0 or 1) ,and imbalanced feature (many 0 , few 1's)

I aim to get high precision (don't care about recall).

I can get high precision of 0.53 if I just assign y=1 if x=1 but when i train DecisionTree, xgboost, randomforest , they all produce model wihch just outputs 1's for any feature value, i.e. they cant find that simple rule (y=1 iff x=1) (precision I get is only 0.38 using these algos) .

what algorithm should I use and how can i make some ML algo learn that simple rule to maximize precision, and do not degenerate to always output 1.

Note that the actual problem will involve many features, thus need robust ML algo.

# sample synthetic data,  DecisionTree fails to find the simple rule
df=pd.DataFrame({'x':np.random.choice([0, 1], size=10000, p=[.99, .01])})
df['y']=np.random.randint(0,2,10000)
df.loc[df.x==1,'y']=1

#precision by using rule  y=1 if x==1 else y=0
df.query('x==1')['y'].mean() # = 1.0 
Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
alexprice
  • 101
  • 2
  • Why do you need this performance on just this one variable? I can't see a way of doing better than the prior probability of each group (whatever precision that gives). – Dave Oct 26 '21 at 20:59
  • it is a minimal example, the problem will have 10's of unbalanced binary features, and balanced target – alexprice Oct 26 '21 at 21:53

1 Answers1

1

You are not saying which decision tree implementation you are using, but if it is sklearn.tree.DecisionTreeClassifier, then note that it will only allow either entropy or the Gini impurity as splitting criteria. If you want to grow a tree that maximizes precision, that will be a different tree. You may want to write your own classifier.

However, don't be surprised if that will still give you "strange" results. Precision suffers from all the same problems as accuracy, sensitivity, specificity, F1 etc. detailed at this thread: Why is accuracy not the best measure for assessing classification models?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • any python implementation of any ML algorithm which would maximize precision? or any way to recast the problem such that usual sklearn models will have better precision? – alexprice Oct 27 '21 at 16:22
  • Nothing that I am aware of, sorry. (Also, I would very much recommend using probabilistic classifications and separating the classification aspect from the subsequent decision, rather than optimizing for precision: https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Oct 27 '21 at 20:56