This is an extension to this question, which is about handling arbitrary (potentially unbounded) reward distributions for the multi-armed bandit problem. Given a sequence of observed rewards $r_t \in \mathbb{R}$ for arm $i$, one could try to approximate the true reward distribution $R_i$ using a Gaussian posterior. It also occurred to me that one could use adaptive kernel density estimation to approximate the true reward distribution. Has the application of adaptive KDE to Thompson sampling been studied? How does it compare empirically with the Gaussian approach, or other possible approaches?
Asked
Active
Viewed 170 times