Cost functions for contextual bandits

Question

I'm using vowpal wabbit to solve a contextual-bandit problem. I'm showing ads to users, and I have a fair bit of information about the context in which the ad is shown (e.g. who the user is, what site they're on, etc.). This seems to be a pretty classic contextual bandit problem, as described by John Langford.

In my situation, there are 2 main responses a user can have to an ad: clicking (possibly multiple times) or not clicking. I have about 1,000 ads I can choose between. Vowpal Wabbit requires a target variable in the form of action:cost:probability for each context. In my case, action and probability are easy to figure out: action is the ad I chose to display, and probability is the likelihood of choosing that ad given my current policy for showing ads.

However, I'm having trouble coming up with a good way to map my payoffs (clicks) to costs. Clicks are obviously good, and multiple clicks on the same ad are also better than single clicks on the same ad. However, not clicking on an ad is neutral: it doesn't actually cost me anything other than the missed opportunity for a click (I'm working in an odd advertising context).

Some ideas I've had are:

cost = -1 * sign(clicks) + 0 * (not clicked)
cost = -1 * clicks + 0 * (not clicked)
cost = -1 * sign(clicks) + 0.01 * (not clicked)
cost = -1 * clicks + 0.01 * (not clicked)

In the case of an action vector of (0, 1, 5, 0) the costs from these 4 functions would be:

(0, -1, -1, 0)
(0, -1, -5, 0)
(0.01, -1, -1, 0.01)
(0.01, -1, -5, 0.01)

There are obviously many other ways to represent that clicks=good and no clicks=bad. In general, how should I be modeling costs for contextual bandit problems in vowpal wabbit? Is it ok to represent benefits as negative costs, or should I re-scale everything such that all costs are positive? Is it ok for relatively neutral actions to have a zero cost, or should I give them a small positive cost to push the model towards the positive actions?

I'm confused by "there are 2 possible actions: a user can click on the ad or a user can not click on the ad." If you're trying to decide which ad to show, shouldn't the ads be the actions? — alto, Apr 02 '14 at 16:25
@alto: I think that should read "there are 2 possible responses we can record for a user". Does that make more sense? — Zach, Apr 02 '14 at 16:27
I'm not sure this really is a contextual bandit problem because I'm not sure what your goal is here. Solving the contextual bandit problem "tries to optimize a policy that chooses actions with minimum cost for the observed contexts." Are you trying to figure out how many ads to run? Trying to model consumer behavior? Something else? — shadowtalker, Feb 21 '15 at 22:23
@ssdecontrol I'm trying to figure out which ad to show, given a context. It's a commonly used example problem for contextual bandits, but I'm getting really bad results from vowpal-wabbit's contextual bandit solver. I was wondering if perhaps there's a different way I should be specifying the "cost" a click or not-click on an ad. — Zach, Feb 23 '15 at 15:18
Hi @Zach. Did you figure out the answer to your question? I have a similar situation and am unable to decide how best to implement it. Currently thinking that I should set cost = 1 if no click and cost = 0 if click. Could you share how you dealt with this problem? — Nik, Apr 13 '18 at 23:39
@nik I never really figured it out, and went with a regular multiclass model from VW instead. — Zach, Apr 16 '18 at 15:40
@Zach did the degression to a multiclass model equally solve your scenario? Curious what has been your cost function for the multiclass model. — matt, Jun 15 '18 at 04:35

score 1 · Answer 1 · answered Jun 15 '18 at 04:32

1

One should probably consult here, for initial guidance: https://arxiv.org/pdf/1802.04064.pdf

It's an empirical evaluation.

answered Jun 15 '18 at 04:32

matt

256
2
13

Cost functions for contextual bandits

1 Answers1

Linked