View: session overviewtalk overview
09:00 | The entropic barrier: a simple and optimal universal self-concordant barrier SPEAKER: unknown ABSTRACT. We prove that the Fenchel dual of the log-Laplace transform of the uniform measure on a convex body in $\R^n$ is a $(1+o(1)) n$-self-concordant barrier, improving a seminal result of Nesterov and Nemirovski. This gives the first explicit construction of a universal barrier for convex bodies with optimal self-concordance parameter. The proof is based on basic geometry of log-concave distributions, and elementary duality in exponential families. The result also gives a new perspective on the minimax regret for the linear bandit problem. |
09:20 | Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions SPEAKER: unknown ABSTRACT. We consider the problem of optimizing an approximately convex function over a bounded convex set in $\mathbb{R}^n$ using only function evaluations. The problem is reduced to sampling from an \emph{approximately} log-concave distribution using the Hit-and-Run method, with query complexity of $\mathcal{O}^*(n^{4.5})$. In the context of zeroth order stochastic convex optimization, the proposed method produces an $\epsilon$-minimizer after $\mathcal{O}^*(n^{7.5}\epsilon^{-2})$ noisy function evaluations by inducing a $\mathcal{O}(\epsilon/n)$-approximately log concave distribution. We also consider the case when the ``amount of non-convexity'' can be large away from the minimum but decays towards the optimum of the function. Other applications of the random walk method include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning. |
09:40 | Improved Sum-of-Squares Lower Bounds for Hidden Clique and Hidden Submatrix Problems SPEAKER: unknown ABSTRACT. Given a large data matrix $A\in\reals^{n\times n}$, we consider the problem of determining whether its entries are i.i.d. with some known marginal distribution $A_{ij}\sim P_0$, or instead $A$ contains a principal submatrix $A_{\sC,\sC}$ whose entries have marginal distribution $A_{ij}\sim P_1\neq P_0$. As a special case, the hidden (or planted) clique problem is finding a planted clique in an otherwise uniformly random graph. Assuming unbounded computational resources, this hypothesis testing problem is statistically solvable provided $|\sC|\ge C \log n$ for a suitable constant $C$. However, despite substantial effort, no polynomial time algorithm is known that succeeds with high probability when $|\sC| = o(\sqrt{n})$. Recently, \cite{meka2013association} proposed a method to establish lower bounds for the hidden clique problem within the Sum of Squares (SOS) semidefinite hierarchy. Here we consider the degree-$4$ SOS relaxation, and study the construction of \cite{meka2013association} to prove that SOS fails unless $k\ge C\, n^{1/3}/\log n$. An argument presented by \cite{BarakLectureNotes} implies that this lower bound cannot be substantially improved unless the witness construction is changed in the proof. Our proof uses the moment method to bound the spectrum of a certain random association scheme, i.e. a symmetric random matrix whose rows and columns are indexed by the edges of an Erd\"os-Renyi random graph. |
10:00 | Sequential Information Maximization: When is Greedy Near-optimal? SPEAKER: unknown ABSTRACT. Optimal information gathering is a central challenge in machine learning and science in general. A common objective that quantifies the usefulness of observations is Shannon's mutual information, defined w.r.t. a probabilistic model. Greedily selecting observations that maximize the mutual information is the method of choice in numerous applications, ranging from Bayesian experimental design to automated diagnosis, to active learning in Bayesian models. Despite its importance and widespread use in applications, little is known about the theoretical properties of sequential information maximization, in particular under noisy observations. In this paper, we analyze the widely used greedy policy for this task, and identify problem instances where it provides provably near-maximal utility, even in the challenging setting of persistent noise. Our results depend on a natural separability condition associated with a channel injecting noise into the observations. We also show that this separability condition is necessary: If it does not hold, the greedy policy can fail to select informative observations. |
10:05 | Low Rank Matrix Completion with Exponential Family Noise SPEAKER: Jean Lafond ABSTRACT. The matrix completion problem consists in reconstructing a matrix from a sample of entries observed with noise. A popular class of estimator, known as nuclear norm penalized estimators, are based on minimizing the sum of a data fitting term and a nuclear norm penalization. Here, we investigate the case where the noise distribution belongs to the exponential family, is sub-exponential and consider a general sampling scheme. We first consider an estimator defined as the minimizer of the sum of a log-likelihood term and a nuclear norm penalization and prove an upper bound on the Frobenius prediction risk. The rate obtained improves on previous works on exponential family completion. When the sampling distribution is known, we propose a second estimator and prove an oracle inequality on the Kullback-Leibler divergence risk, which translates immediatly into an upper bound on the Frobenius risk. Finally, we show that all the rates obtained are minimax optimal up to a logarithmic factor. |
10:10 | Fast Exact Matrix Completion with Finite Samples SPEAKER: unknown ABSTRACT. Matrix completion is the problem of recovering a low rank matrix by observing a small fraction of its entries. A series of recent works \citep{Keshavan2012,JainNS2013,Hardt2013} have proposed fast non-convex optimization based iterative algorithms to solve this problem. However, the sample complexity in all these results is sub-optimal in its dependence on the rank, condition number and the desired accuracy. In this paper, we present a fast iterative algorithm that solves the matrix completion problem by observing $\order{nr^5 \log^3 n}$ entries, which is independent of the condition number and the desired accuracy. The run time of our algorithm is $\order{nr^7\log^3 n\log 1/\epsilon}$ which is near linear in the dimension of the matrix. To the best of our knowledge, this is the first near linear time algorithm for exact matrix completion with finite sample complexity (i.e. independent of $\epsilon$). Our algorithm is based on a well known projected gradient descent method, where the projection is onto the (non-convex) set of low rank matrices. There are two key ideas in our result: 1) our argument is based on a $\ell_\infty$ norm potential function (as opposed to the spectral norm) and provides a novel way to obtain perturbation bounds for it. 2) we prove and use a natural extension of the Davis-Kahan theorem to obtain perturbation bounds on the best low rank approximation of matrices with good eigen-gap. Both of these ideas may be of independent interest. |
10:15 | Exp-Concavity of Proper Composite Losses SPEAKER: unknown ABSTRACT. The goal of on-line prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and O(T^0.5) and O(log T) regret bounds can be achieved for convex losses (Zinkevich (2003)) and strictly convex losses with bounded first and second derivatives (Hazan et al. (2007)) respectively. In special cases like the Aggregating Algorithm (Vovk (1995)) with mixable losses and the Weighted Average Algorithm (Kivinen and Warmuth (1999)) with exp-concave losses, it is possible to achieve O(1) regret bounds. van Erven (2012) has argued that mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses (Van Erven et al. (2012)), we show that it is possible to transform (re-parameterize) any beta-mixable proper loss into a beta-exp-concave composite loss with the same beta. |
10:20 | Vector-Valued Property Elicitation SPEAKER: unknown ABSTRACT. The elicitation of a statistic, or property of a distribution, is the task of devising proper scoring rules, equivalently proper losses, which incentivize an agent or algorithm to truthfully estimate the desired property of the underlying probability distribution. Exploiting the connection of such elicitation to convex analysis, we address the vector-valued property case, which has received relatively little attention in the literature despite its applications to both machine learning and statistics. We first provide a very general characterization of linear and ratio-of-linear properties, the first of which resolves an open problem by unifying and strengthening several previous characterizations in machine learning and statistics. We then ask which vectors of properties admit nonseparable scores, which cannot be expressed as a sum of scores for each coordinate separately, a natural desideratum for machine learning. We show that linear and ratio-of-linear do admit nonseparable scores, and provide evidence for the conjecture that these are the only such properties (up to link functions). Finally, we provide a general method for producing identification functions and address an open problem by showing that convex maximal level sets are insufficient for elicitability in general. |
10:25 | Generalized Mixability via Entropic Duality SPEAKER: unknown ABSTRACT. Mixability is a property of a loss which characterizes when fast convergence is possible in the game of prediction with expert advice. We show that a key property of mixability generalizes, and the $\exp$ and $\log$ operations present in the usual theory are not as special as one might have thought. In doing so we introduce a more general notion of $\Phi$-mixability where $\Phi$ is a general entropy (\ie, any convex function on probabilities). We show how a property shared by the convex dual of any such entropy yields a natural algorithm (the minimizer of a regret bound) which, analogous to the classical Aggregating Algorithm, is guaranteed a constant regret when used with $\Phi$-mixable losses. We characterize which $\Phi$ have non-trivial $\Phi$-mixable losses and relate $\Phi$-mixability and its associated Aggregating Algorithm to potential-based methods, a Blackwell-like condition, mirror descent, and risk measures from finance. We also define a notion of ``dominance'' between different entropies in terms of bounds they guarantee and conjecture that classical mixability gives optimal bounds, for which we provide some supporting empirical evidence. |
10:30 | On Consistent Surrogate Risk Minimization and Property Elicitation SPEAKER: unknown ABSTRACT. Surrogate risk minimization is a popular framework for supervised learning; property elicitation is a widely studied area in probability forecasting, machine learning, statistics and economics. In this paper, we connect these two themes by showing that calibrated surrogate losses in supervised learning can essentially be viewed as eliciting or estimating |
10:35 | Label optimal regret bounds for online local learning SPEAKER: unknown ABSTRACT. We resolve an open question from Christiano (2014b) posed in COLT’14 regarding the optimal dependency of the regret achievable for online local learning on the size of the label set. In this framework the algorithm is shown a pair of items at each step, chosen from a set of n items. The learner then predicts a label for each item, from a label set of size L and receives a real valued payoff. This is a natural framework which captures many interesting scenarios such as collaborative filtering, online gambling, and online max cut among others. Christiano (2014a) designed an efficient online learning algorithm for this problem achieving a regret of O(\sqrt{nL^3T}), where T is the number of rounds. Information theoretically, one can achieve a regret of O(\sqrt{n log LT}). One of the main open questions left in this framework concerns closing the above gap. In this work, we provide a complete answer to the question above via two main results. We show, via a tighter analysis, that the semi-definite programming based algorithm of Christiano (2014a), in fact achieves a regret of O(\sqrt{n L T}). Second, we show a matching computational lower bound. Namely, we show that a polynomial time algorithm for online local learning with lower regret would imply a polynomial time algorithm for the planted clique problem in regimes widely believed to be hard. We prove a similar hardness result under a related conjecture concerning planted dense subgraphs that we put forth. Unlike planted clique, the dense subgraph version does not have quasi-polynomial time algorithms. Computational lower bounds for online learning are relatively rare, and we hope that the ideas developed in this work will lead to lower bounds for other online learning scenarios as well. |
11:10 | Applications of Learning Theory in Algorithmic Game Theory SPEAKER: Tim Roughgarden ABSTRACT. Algorithmic game theory is a field that uses and extends tools from economics and game theory to reason about fundamental computer science problems. The field is important both for its applications, which span the gamut from network routing to online advertising, and for its remarkably diverse and rich connections to other areas of theoretical computer science, including complexity theory and approximation algorithms. In this talk, we survey two ways in which definitions and tools from learning theory have been crucial to recent advances in algorithmic game theory. First, we outline a theory of robust bounds on the "price of anarchy" --- meaning approximation guarantees for game-theoretic equilibria --- that apply to all outcome sequences generated by no-regret learners playing a multi-player game. Second, we explain how to use concepts from learning theory to make traditional (Bayesian) optimal auction theory operational, replacing the practically problematic "common prior" assumption with a data-driven approach. |
15:10 | Minimax Fixed-Design Linear Regression SPEAKER: unknown ABSTRACT. We consider a linear regression game in which the covariates are known in advance: at each round, the learner predicts a real-value, the adversary reveals a label, and the learner incurs a squared error loss. The aim is to minimize the regret with respect to linear predictions. For a variety of constraints on the adversary's labels, we show that the minimax optimal strategy is linear, with a parameter choice that is reminiscent of ordinary least squares (and as easy to compute). The predictions depend on all covariates, past and future, with a particular weighting assigned to future covariates corresponding to the role that they play in the minimax regret. We study two families of label sequences: box constraints (under a covariate compatibility condition), and a weighted 2-norm constraint that emerges naturally from the analysis. The strategy is adaptive in the sense that it requires no knowledge of the constraint set. We obtain an explicit expression for the minimax regret for these games. For the case of uniform box constraints, we show that, with worst case covariate sequences, the regret is $O(d\log T)$, with no dependence on the scaling of the covariates. |
15:15 | A Chaining Algorithm for Online Nonparametric Regression SPEAKER: unknown ABSTRACT. We consider the problem of online nonparametric regression with arbitrary deterministic sequences. Using ideas from the chaining technique, we design an algorithm that achieves a Dudley-type regret bound similar to the one obtained in a non-constructive fashion by Rakhlin and Sridharan (2014). Our regret bound is expressed in terms of the metric entropy in the sup norm, which yields optimal guarantees when the metric and sequential entropies are of the same order of magnitude. In particular our algorithm is the first one that achieves optimal rates for online regression over Hölder balls. In addition we show for this example how to adapt our chaining algorithm to get a reasonable computational efficiency with similar regret guarantees (up to a log factor). |
15:20 | First-order regret bounds for combinatorial semi-bandits SPEAKER: Gergely Neu ABSTRACT. We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. The performance of the learner is measured in terms of its total expected regret against the best fixed action after $T$ rounds. In this paper, we propose a computationally efficient algorithm that improves existing worst-case guarantees of $O(\sqrt{T})$ to $O(\sqrt{L_T^*})$, where $L_T^*$ is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting. |
15:25 | Online Learning with Feedback Graphs: Beyond Bandits SPEAKER: unknown ABSTRACT. We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multi-armed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced $T$-round learning problem. Specifically, we show that any feedback graph belongs to one of three classes: \emph{strongly observable} graphs, \emph{weakly observable} graphs, and \emph{unobservable} graphs. We prove that the first class induces learning problems with $\widetilde\Theta(\alpha^{1/2} T^{1/2})$ minimax regret, where $\alpha$ is the independence number of the underlying graph; the second class induces problems with $\widetilde\Theta(\delta^{1/3}T^{2/3})$ minimax regret, where $\delta$ is the domination number of a certain portion of the graph; and the third class induces problems with linear minimax regret. Our results subsume much of the previous work on learning with feedback graphs and reveal new connections to partial monitoring games. We also show how the regret is affected if the graphs are allowed to vary with time. |
15:30 | Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem SPEAKER: unknown ABSTRACT. We study the $K$-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight asymptotic regret lower bound that is based on the information divergence. An algorithm that is inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 2010) is proposed, and its regret is analyzed. The proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound. Experimental comparisons of dueling bandit algorithms show that the proposed algorithm significantly outperforms existing ones. |
15:35 | Contextual Dueling Bandits SPEAKER: unknown ABSTRACT. We consider the problem of learning to choose actions using contextual information when provided with limited feedback in the form of relative pairwise comparisons. We study this problem in the dueling-bandits framework of Yue et al. (2009), which we extend to incorporate context. Roughly, the learner’s goal is to find the best policy, or way of behaving, in some space of policies, although “best” is not always so clearly defined. Here, we propose a new and natural solution concept, rooted in game theory, called a von Neumann winner, a randomized policy that beats or ties every other policy. We show that this notion overcomes important limitations of existing solutions, particularly the Condorcet winner which has typically been used in the past, but which requires strong and often unrealistic assumptions. We then present three efficient algorithms for online learning in our setting, and for approximating a von Neumann winner from batch-like data. The first of these algorithms achieves particularly low regret, even when data is adversarial, although its time and space requirements are linear in the size of the policy space. The other two algorithms require time and space only logarithmic in the size of the policy space when provided access to an oracle for solving classification problems on the space. |