3

This question is about how to deploy contextual bandits(CMAB) in the context of web site optimization and online experimentation. I implemented contextual free MAB(MAB). When I run a MAB experiment, I just run it standalone in the online experimentation platform without setting up an A/B test to compare it to the current production version. Should I do the same process for running CMAB? Or I should break the process into the following 3 steps:

  1. Expose CMAB to production data to train it until converged.
  2. Run A/B test to measure impact of converged CMAB model vs. production.
  3. If CMAB model wins, deploy in production.
etang
  • 623
  • 3
  • 9

1 Answers1

4

Section 8.6 of the book by Slivkins (https://arxiv.org/abs/1904.07272) has a very nice and practical section that exactly focuses on the deployment of CMAB models.

Essentially you have a learning loop that consists of 4 phases:

  • Exploration: Actions are chosen by the exploration policy: a fixed policy that runs in the front-end and combines exploration and exploitation

  • Logging: records the “data points” i.e. the context, the chosen arm, and the reward (and also the propensities if you use an Inverse Propensity Score approach for policy evaluation/learning).

  • Learning: The main goal is to learn a better “default policy”, and perhaps also better exploration parameters. The policy training algorithm should be online, allowing fast updates when new data points are received, so as to enable rapid iterations of the learning loop

  • Policy deployment New default policies and/or exploration parameters are deployed into the exploration policy, thereby completing the learning loop.

For more details please refer to the Slivkin's book, it's really worth reading.

Apprentice
  • 642
  • 1
  • 24