Exploiting features in a multiarmed bandit scenario

Question

I am facing a challenging problem:

Say I have shirts of three different colors (same price). And say I am running a strange kind of store in into which people come in one by one, and I can show them only one shirt, and they decide weather to buy or not to buy before they leave.

I wish to optimize my sells.

Up to this point, this looks to me like a classic multiarmed bandit problem, which would be classified as reinforcement-learning problem.

Now say that I have a number of features about each person that comes into my store: age/gender and so on (a fairly large mix of categorical and non-categorical measures).

How can I use this information to optimize my sells?

I though about taking the following approach: Assuming that the expected rate of purchase for each of the colors is low, and though NOT equal, it is about the same (say around 1%), I can look at this problem as a classification problem (supervised learning). I will forget about the people who did not buy a shirt, and ask, from those who did buy a shirt, to which of the three groups of buyers does this new person resembles most? I`ll use a classification algorithm and try to show him the colored shirt that my trained classifier predicted.

Even though I do not know weather the buyers would have bought other colored shirt had they been presented, my hope is that with large number of buyers, and hopefully relevant features, a classifier can find similarities between the groups of buyers.

I know that this is far from a classic classification problem, but I think this can do better than just ignoring the feature I know of the people, and just use a multiarmed bandit like approach.

My questions:

Is my approach reasonable?

Any different ideas to approach this?

This is a contextual multi-armed bandit problem. – alto May 29 '14 at 18:03 — alto, May 29 '14 at 18:03

score 2 · Answer 1 · answered Feb 02 '16 at 10:16

I think what you are looking for are Contextual Bandits. John Langford has a fairly good post on the topic here.

Basically you have 2 choices:

learn a bandit for every context (e.g. one for males, one for females)
define a set of "shirt choosing policies" and treat each of them as a separate arm

score 0 · Answer 2 · answered May 29 '14 at 17:07

0

I'd imagine the best (but maybe not the most sophisticated) way to solve this problem is simply run concurrent experiments on each of your user segments and allow the bandit to optimize independently for each segment.

answered May 29 '14 at 17:07

Splitforce

1
1

Exploiting features in a multiarmed bandit scenario

2 Answers2