I'm working on a multi-armed bandit problem where we do not have any information about the reward distribution.
I have found many papers which guarantee regret bounds for a distribution with known bound, and for general distributions with support in [0,1].
I would like to find out if there is a way to perform well in an environment where the reward distribution does not have any guarantees about its support. I'm trying to compute a nonparametric tolerance limit and using that number to scale the reward distribution so I can use the algorithm 2 specified on this paper (http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf). Does anyone think this approach will work?
If not, can anyone point me to the right spot?
Thanks a bunch!