3

I want to predict a continuous value between 0 and 1 and the true labels are 99% (out of 100000 samples) zero and rest of them are between 0 and 1. What are the approaches that I can take so that I can beat a naive classifier like predicting 0 all the time?

I have tried this method:

Training: First make the problem into classification by taking the (target==0) as class 0 and (target>0) as class 1. Then take all the samples of class 1 and do regression on them.

Testing: First classify the sample into 0 or 1 class. If it belongs to class 1 predict the continuous value between 0 and 1 using the regression model.

I am measuring MAE as the performance metric. The above method is very close to the naive classifier (predicting 0 all the time: MAE is 0.005) but still cannot beat it.

nth-attempt
  • 149
  • 1
  • 5
  • Can you explain your problem further? What sort of variables are you using as predictors? – mkt Jul 08 '17 at 18:52
  • I have some categorical features which upon one hot can give up to 30,000 features. I also have 5 quantitative variables. I used 1000+ features and ran a PCA over it and I found 99.7% variance explained along one dimension. So, using 1000+ features was giving me same results as using 2 features generated by PCA. – nth-attempt Jul 08 '17 at 19:12
  • Check https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning for interesting answers. Plus notice that regression predicts the conditional mean, conditional means in your case are most likely almost zero, so it probably does what it is intended to do. – Tim Jul 08 '17 at 21:05
  • 2
    I don't know the answer to your question, but I know that Luis Torgo had some papers on imbalanced data and regression. This is one of them https://arxiv.org/pdf/1505.01658 You may find some answers there. – Jacques Wainer Jul 08 '17 at 20:47
  • @JacquesWainer thanks for the paper, in 3.2 Metrics for Regression Tasks it describes the problem clearly, but the solutions all seem to be quite ad-hoc to me, is there any study comparing them? – Christabella Irwanto Aug 25 '20 at 10:54

1 Answers1

2

Use a zero inflated model. This is just a mixed model which first characterizes the probability of observing a 0. If the model predicts a nonzero result, a sequential or secondary prediction gives a continuous range result.

AdamO
  • 52,330
  • 5
  • 104
  • 209