View: session overviewtalk overview
14:00 | Simple Bayesian Algorithms for Best Arm Identification SPEAKER: Daniel Russo ABSTRACT. This paper considers the optimal adaptive allocation of measurement effort in order to identify the best among a finite set of options or designs. An experimenter sequentially chooses designs to measure and observes noisy signals of their quality, with the goal of confidently identifying the best design after a small number of measurements. We propose two simple Bayesian algorithms for adaptively allocating measurement effort. One, which we call Top-Two sampling, computes the two designs with the highest posterior probability of being optimal, and then randomizes to select among these two. The other algorithm we propose is a modified version of Thompson sampling that is tailored for identifying the best design. We prove that these simple algorithms satisfy a strong optimality property. In a frequestist setting where the true quality of the designs is fixed, the posterior is said to be consistent if it correctly identifies the optimal design, in the sense that that the posterior probability assigned to the event that some other design is optimal converges to zero as measurements are collected. We show that under the proposed algorithms this convergence occurs at an \emph{exponential rate}, and the corresponding exponent is the best possible among all allocation rules. |
14:20 | Optimal Best Arm Identification with Fixed Confidence SPEAKER: unknown ABSTRACT. We provide a complete characterization of the complexity of best-arm identification in one-parameter bandit problems. We prove a new, tight lower bound on the sample complexity. We propose the 'Track-and-Stop' strategy, which is proved to be asymptotically optimal. It consists in a new sampling rule (which tracks the optimal proportions of arm draws highlighted by the lower bound) and in a stopping rule named after Chernoff, for which we give a new analysis. |
14:40 | Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem SPEAKER: Andrea Locatelli ABSTRACT. We consider the problem of \textit{best arm identification} with a \textit{fixed budget $T$}, in the $K$-armed stochastic bandit setting, with arms distribution defined on $[0,1]$. We prove that any bandit strategy, for at least one bandit problem characterized by a complexity $H$, will misidentify the best arm with probability lower bounded as $$\exp\Big(-\frac{T}{\log(K)H}\Big),$$ where $H$ is the sum for all sub-optimal arms of the inverse of the squared gaps. Our result disproves formally the general belief - coming from results in the fixed confidence setting - that there must exist an algorithm for this problem whose probability of error is upper bounded as $\exp(-T/H)$. This also proves that some existing strategies based on the Successive Rejection of the arms are optimal - closing therefore the current gap between upper and lower bounds for the fixed budget best arm identification problem. |
14:50 | Best-of-K Bandits SPEAKER: unknown ABSTRACT. This paper studies the Best-of-K Bandit game: At each time the player chooses a subset S among all N-choose-K possible options and observes reward max(X(i) : i in S) where X is a random vector drawn from a joint distribution. The objective is to identify the subset that achieves the highest expected reward with high probability using as few queries as possible. We present distribution-dependent lower bounds which force a learner to consider all N-choose-K subsets and match naive extensions of known upper bounds in the bandit setting obtained by treating each subset as a separate arm. Nevertheless, we present evidence that exhaustive search may be avoided for certain, favorable distributions because the influence of high-order order correlations may be dominated by lower order statistics. Finally, we present an algorithm and analysis for independent arms, which mitigates the surprising non-trivial information occlusion that occurs due to only observing the max in the subset. This may inform strategies for more general dependent measures, and we complement these result with independent-arm lower bounds. |
15:00 | Pure Exploration of Multi-armed Bandit Under Matroid Constraints SPEAKER: unknown ABSTRACT. We study the pure exploration problem subject to a matroid constraint (Best-Basis) in a stochastic multi-armed bandit game. In a Best-Basis instance, we are given $n$ stochastic arms with unknown reward distributions, as well as a matroid $M$ over the arms. Let the weight of an arm be the mean of its reward distribution. Our goal is to identify a basis of $M$ with the maximum total weight, using as few samples as possible. The problem is a significant generalization of the best arm identification problem and the top-$k$ arm identification problem, which have attracted significant attentions in recent years. We study both the exact and PAC versions of Best-Basis, and provide algorithms with nearly-optimal sample complexities for these versions. Our results generalize and/or improve on several previous results for the top-$k$ arm identification problem and the combinatorial pure exploration problem (when the combinatorial constraint is a matroid). |
15:30 | Maximin Action Identification: A New Bandit Framework for Games SPEAKER: unknown ABSTRACT. We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower- and upper- confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm. |
15:40 | Instance-dependent Regret Bounds for Dueling Bandits SPEAKER: unknown ABSTRACT. We study the multi-armed dueling bandit problem in which feedback is provided in the form of relative comparisons between pairs of actions, with the goal of eventually learning to select actions that are close to the best. Following Dudik et al. (2015), we aim for algorithms whose performance approaches that of the optimal randomized choice of actions, the von Neumann winner, expressly avoiding more restrictive assumptions, for instance, regarding the existence of a single best action (a Condorcet winner). In this general setting, the best known algorithms achieve regret O(\sqrt{KT}) in T rounds with K actions. In this paper, we present the first instance-dependent regret bounds for the general problem, focusing particularly on when the von Neumann winner is sparse. Specifically, we propose a new algorithm whose regret, relative to a unique von Neumann winner with sparsity s, is at most O(\sqrt{sT}), plus an instance-dependent constant. Thus, when the sparsity is much smaller than the total number of actions, our result indicates that learning can be substantially faster. |
15:50 | An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives SPEAKER: unknown ABSTRACT. We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed budget. The learning agent competes with an arbitrary set of context-dependent policies. This problem was introduced by Badanidiyuru et al.(2014), who gave a computationally inefficient algorithm with near optimal regret bounds for it. We give a "computationally efficient" algorithm for this problem with slightly better regret bounds, by generalizing the approach of Agarwal et al.(2014) for the non-constrained version of the problem. The computational time of our algorithm scales "logarithmically" in the size of the policy space. This answers the main open question of Badanidiyuru et al.(2014). We also extend our results to a variant where there are no knapsack constraints but the objective is an arbitrary Lipschitz concave function of the sum of outcome vectors. |
16:00 | Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits SPEAKER: Tor Lattimore ABSTRACT. I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Gittins index strategy is an improvement on existing algorithms with finite-time regret guarantees such as UCB and Thompson sampling. |
16:10 | An algorithm with nearly optimal pseudo-regret \\for both stochastic and adversarial bandits SPEAKER: unknown ABSTRACT. We present an algorithm that achieves almost optimal pseudo-regret bounds against adversarial and stochastic bandits. Against adversarial bandits the pseudo-regret is $\bigO{\sqrt{K n \log n}}$ and against stochastic bandits the regret is $\bigO{\sum_i (\log n)/\Delta_i}$. We also show that no algorithm with $\bigO{\log n}$ pseudo-regret against stochastic bandits can achieve $\tO{\sqrt{n}}$ expected regret against adaptive adversarial bandits. This complements previous results of \cite{BS12} that show $\tO{\sqrt{n}}$ expected adversarial regret with $\bigO{(\log n)^2}$ stochastic pseudo-regret. |