Skip to content Skip to navigation

A general framework to analyze stochastic linear bandit

Tuesday, April 21, 2020 - 4:30pm

Speaker:   Mohsen Bayati, Stanford Graduate School of Business

Abstract:   Multi-armed bandit (MAB) experiments have recently received significant attention in data-centric enterprises, given their promise in reducing opportunity cost of experimentation. In this talk we consider the well-known stochastic linear bandit problem which includes (as special case) standard MAB experiments and their personalized (contextual) counterpart. In this setting a decision-maker sequentially chooses among a set of given actions in $\mathbb{R}^d$, observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length $T$. We introduce a general family of algorithms for the problem that achieve the best possible performance (i.e., are rate optimal), and show that some of the well-known algorithms in the literature such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) are special cases of our family of algorithms. Therefore, we obtain a unified proof of rate optimality for all of these algorithms. Our analysis also yields a number of new results and solves an open problem. For example, we show that TS can incur a linear worst-case regret, unless it uses inflated (by a factor of $\sqrt{d}$) posterior variances at each step. We also show that TS can incur a linear Bayesian regret if it does not use the correct prior or noise distribution.

This talk is based on joint work with Nima Hamidi, and a preliminary draft is available on ArXiv.