0

I am trying to build a model to predict click through rate for advertisement. I have the data in the following format

    ad_id | ad_feature_1 | ad_feature_2 | label (1/0) | count

    ad_xs.    0.2            0.5             1           100
    ad_xs.    0.2            0.5             0           10000
    ad_xz.    ..              ..             1           10
 
and so on where the label = 1 indicates a click and 0 indicates that the ad was shown but not clicked. Count represent the number of times the row is present in the data   

This is an imbalanced data set where the feature values are same but the labels are different. How to build a classifier for such kind of data . Also currently I do not have the user or query data to enrich the features.

NG_21
  • 1,436
  • 3
  • 17
  • 25
  • Identical features with different outcomes is completely normal, the difference is just random variation. All classifiers can handle that. Unbalanced classes are almost certainly not a problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) What is your question? – Stephan Kolassa Jul 27 '21 at 07:12
  • @StephanKolassa In such cases how should we evaluate the model . Because given a set of features the trained model will always predict the same value where as in the dataset the outcomes can be different. Any resources which address these scenarios will be helpful – NG_21 Jul 27 '21 at 08:15
  • 2
    Use probabilistic classifications. Then given a feature set, the prediction will be a constant vector of probabilities, like a probability for class 1 of 0.1. Evaluate these predictions using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). – Stephan Kolassa Jul 27 '21 at 08:31
  • This has been studied _extensively_ on this site. If you think that class imbalance is a problem you are using the wrong methods. – Frank Harrell Jul 27 '21 at 12:27

0 Answers0