I'm using vowpal wabbit to solve a contextual-bandit problem. I'm showing ads to users, and I have a fair bit of information about the context in which the ad is shown (e.g. who the user is, what site they're on, etc.). This seems to be a pretty classic contextual bandit problem, as described by John Langford.
In my situation, there are 2 main responses a user can have to an ad: clicking (possibly multiple times) or not clicking. I have about 1,000 ads I can choose between. Vowpal Wabbit requires a target variable in the form of action:cost:probability
for each context. In my case, action
and probability
are easy to figure out: action
is the ad I chose to display, and probability
is the likelihood of choosing that ad given my current policy for showing ads.
However, I'm having trouble coming up with a good way to map my payoffs (clicks) to costs. Clicks are obviously good, and multiple clicks on the same ad are also better than single clicks on the same ad. However, not clicking on an ad is neutral: it doesn't actually cost me anything other than the missed opportunity for a click (I'm working in an odd advertising context).
Some ideas I've had are:
- cost = -1 * sign(clicks) + 0 * (not clicked)
- cost = -1 * clicks + 0 * (not clicked)
- cost = -1 * sign(clicks) + 0.01 * (not clicked)
- cost = -1 * clicks + 0.01 * (not clicked)
In the case of an action vector of (0, 1, 5, 0)
the costs from these 4 functions would be:
(0, -1, -1, 0)
(0, -1, -5, 0)
(0.01, -1, -1, 0.01)
(0.01, -1, -5, 0.01)
There are obviously many other ways to represent that clicks=good
and no clicks=bad.
In general, how should I be modeling costs for contextual bandit problems in vowpal wabbit? Is it ok to represent benefits as negative costs, or should I re-scale everything such that all costs are positive? Is it ok for relatively neutral actions to have a zero cost, or should I give them a small positive cost to push the model towards the positive actions?