How to use a variable for training a model but not for making predictions?

Question

I’m trying to predict the click through rate (CTR) of a product listing. As an input to train my model, I want to use the position of the product in the listing (if it’s the first product listed, or if it’s the second listed and so on), since that is an import component for CTR.

The problem is that I don’t have the position when making predictions for new products (because I don’t know where they will be placed in different search filters). Ideally, I would predict all positions and use the median value of those predictions as the CTR result, but that’s not feasible computational-wise.

It seems like it should be a common problem when dealing with CTR forecasting. So I’d like to ask for suggestions about how to handle that situation, where I have a variable on my training data but I don’t know it’s value for making predictions.

Thanks!

Please say more about the nature of your model and the way that you account for list position in the model. With models that return things like coefficients for list position you might have a built-in solution that doesn't require re-computation. — EdM, Jul 14 '20 at 20:18
In my opinion, it's too simple a model of 'CTR' that the product's list position chiefly determines the click-through rate of webshop visitors. I would recommend you going back to the drawing board and and model CTR from variables that are more meaningful, and which are always known before the new product is being launched in the webshop.. — Match Maker EE, Jul 14 '20 at 22:31
@MatchMakerEE One important detail that I left out is the reason for not just outright removing the “position” variable. The CTR prediction is what will define the ordering of the listings (higher CTR, higher position). So if I don’t take the position into consideration, I’ll be introducing an undesirable feedback to my model in future re-trainings. Products that initially had higher CTR predictions and got a higher position as a result, will have higher actual CTR values, and that entails skewed predictions when I re-train the model using that data. — Celio, Jul 15 '20 at 14:01
@EdM I'm using a gradient boosting tree based model. Since the resulting model will be very much so a "blackbox", I don't think I can deal with the issue in the manner that you suggested, unfortunately. — Celio, Jul 15 '20 at 14:06

score 0 · Answer 1 · answered Jul 15 '20 at 22:13

I have no experience with modeling CTRs, but one thought that comes to mind is to use the flexibility of gradient boosting trees to handle missing values. As I understand it, the model building assigns directions at each node to missing values, even if there aren't any missing in the training set. CTR could still be predicted based on the other characteristics of the listing but without values for the list position. So if all of the new candidate listings that you want to rank are treated the same in this respect, that would provide a way to rank them for the initial ranking.

You would have to do a good deal of testing, however, to see if this approach gives reasonable results for your application. As the page I linked above puts it:

Be careful if your scoring data has its missing values distributed differently from your training data. xgboost's missing handling is convenient but doesn't protect against masking.

Searching to try to understand that a bit better, I found this discussion: https://github.com/microsoft/LightGBM/issues/2921 So it seems like when you don't have null values in your training data, any null values are replaced by zero when making predictions (at least in LightGBM), sadly. — Celio, Jul 16 '20 at 16:15
@Celio even in that case, the prediction would be for the CTR if the candidate were at the head of the list (if you treat list position as numeric with 1 at the top of the list, a split by `position <2` would be the same for values of 0 or 1). That might not be a bad thing for determining initial ranking of new candidates. Or if you have a fixed number of list positions, renumber so that 0 is in a middle position. Or perhaps you could remove some fraction of list position values from the training data. Play a lot with your training data before you give up on this. — EdM, Jul 16 '20 at 16:34

How to use a variable for training a model but not for making predictions?

1 Answers1