COLT 2015: 28TH ANNUAL CONFERENCE ON LEARNING THEORY
PROGRAM FOR SUNDAY, JULY 5TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:40 Session 11: Optimization, Online Learning, Loss Functions
09:00
The entropic barrier: a simple and optimal universal self-concordant barrier
SPEAKER: unknown

ABSTRACT. We prove that the Fenchel dual of the log-Laplace transform of the uniform measure on a convex body in $\R^n$ is a $(1+o(1)) n$-self-concordant barrier, improving a seminal result of Nesterov and Nemirovski. This gives the first explicit construction of a universal barrier for convex bodies with optimal self-concordance parameter. The proof is based on basic geometry of log-concave distributions, and elementary duality in exponential families. The result also gives a new perspective on the minimax regret for the linear bandit problem.

09:20
Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions
SPEAKER: unknown

ABSTRACT. We consider the problem of optimizing an approximately convex function over a bounded convex set in $\mathbb{R}^n$ using only function evaluations. The problem is reduced to sampling from an \emph{approximately} log-concave distribution using the Hit-and-Run method, with query complexity of $\mathcal{O}^*(n^{4.5})$. In the context of zeroth order stochastic convex optimization, the proposed method produces an $\epsilon$-minimizer after $\mathcal{O}^*(n^{7.5}\epsilon^{-2})$ noisy function evaluations by inducing a $\mathcal{O}(\epsilon/n)$-approximately log concave distribution. We also consider the case when the ``amount of non-convexity'' can be large away from the minimum but decays towards the optimum of the function. Other applications of the random walk method include private computation of empirical risk minimizers, two-stage stochastic programming, and approximate dynamic programming for online learning.

09:40
Improved Sum-of-Squares Lower Bounds for Hidden Clique and Hidden Submatrix Problems
SPEAKER: unknown

ABSTRACT. Given a large data matrix $A\in\reals^{n\times n}$, we consider the problem of determining whether its entries are i.i.d. with some known marginal distribution $A_{ij}\sim P_0$, or instead $A$ contains a principal submatrix $A_{\sC,\sC}$ whose entries have marginal distribution $A_{ij}\sim P_1\neq P_0$. As a special case, the hidden (or planted) clique problem is finding a planted clique in an otherwise uniformly random graph.

Assuming unbounded computational resources, this hypothesis testing problem is statistically solvable provided $|\sC|\ge C \log n$ for a suitable constant $C$. However, despite substantial effort, no polynomial time algorithm is known that succeeds with high probability when $|\sC| = o(\sqrt{n})$. Recently, \cite{meka2013association} proposed a method to establish lower bounds for the hidden clique problem within the Sum of Squares (SOS) semidefinite hierarchy.

Here we consider the degree-$4$ SOS relaxation, and study the construction of \cite{meka2013association} to prove that SOS fails unless $k\ge C\, n^{1/3}/\log n$. An argument presented by \cite{BarakLectureNotes} implies that this lower bound cannot be substantially improved unless the witness construction is changed in the proof. Our proof uses the moment method to bound the spectrum of a certain random association scheme, i.e. a symmetric random matrix whose rows and columns are indexed by the edges of an Erd\"os-Renyi random graph.

10:00
Sequential Information Maximization: When is Greedy Near-optimal?
SPEAKER: unknown

ABSTRACT. Optimal information gathering is a central challenge in machine learning and science in general. A common objective that quantifies the usefulness of observations is Shannon's mutual information, defined w.r.t. a probabilistic model. Greedily selecting observations that maximize the mutual information is the method of choice in numerous applications, ranging from Bayesian experimental design to automated diagnosis, to active learning in Bayesian models. Despite its importance and widespread use in applications, little is known about the theoretical properties of sequential information maximization, in particular under noisy observations. In this paper, we analyze the widely used greedy policy for this task, and identify problem instances where it provides provably near-maximal utility, even in the challenging setting of persistent noise. Our results depend on a natural separability condition associated with a channel injecting noise into the observations. We also show that this separability condition is necessary: If it does not hold, the greedy policy can fail to select informative observations.

10:05
Low Rank Matrix Completion with Exponential Family Noise
SPEAKER: Jean Lafond

ABSTRACT. The matrix completion problem consists in reconstructing a matrix from a sample of entries observed with noise. A popular class of estimator, known as nuclear norm penalized estimators, are based on minimizing the sum of a data fitting term and a nuclear norm penalization. Here, we investigate the case where the noise distribution belongs to the exponential family, is sub-exponential and consider a general sampling scheme. We first consider an estimator defined as the minimizer of the sum of a log-likelihood term and a nuclear norm penalization and prove an upper bound on the Frobenius prediction risk. The rate obtained improves on previous works on exponential family completion. When the sampling distribution is known, we propose a second estimator and prove an oracle inequality on the Kullback-Leibler divergence risk, which translates immediatly into an upper bound on the Frobenius risk. Finally, we show that all the rates obtained are minimax optimal up to a logarithmic factor.

10:10
Fast Exact Matrix Completion with Finite Samples
SPEAKER: unknown

ABSTRACT. Matrix completion is the problem of recovering a low rank matrix by observing a small fraction of its entries. A series of recent works \citep{Keshavan2012,JainNS2013,Hardt2013} have proposed fast non-convex optimization based iterative algorithms to solve this problem. However, the sample complexity in all these results is sub-optimal in its dependence on the rank, condition number and the desired accuracy.

In this paper, we present a fast iterative algorithm that solves the matrix completion problem by observing $\order{nr^5 \log^3 n}$ entries, which is independent of the condition number and the desired accuracy. The run time of our algorithm is $\order{nr^7\log^3 n\log 1/\epsilon}$ which is near linear in the dimension of the matrix. To the best of our knowledge, this is the first near linear time algorithm for exact matrix completion with finite sample complexity (i.e. independent of $\epsilon$).

Our algorithm is based on a well known projected gradient descent method, where the projection is onto the (non-convex) set of low rank matrices. There are two key ideas in our result: 1) our argument is based on a $\ell_\infty$ norm potential function (as opposed to the spectral norm) and provides a novel way to obtain perturbation bounds for it. 2) we prove and use a natural extension of the Davis-Kahan theorem to obtain perturbation bounds on the best low rank approximation of matrices with good eigen-gap. Both of these ideas may be of independent interest.

10:15
Exp-Concavity of Proper Composite Losses
SPEAKER: unknown

ABSTRACT. The goal of on-line prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and O(T^0.5) and O(log T) regret bounds can be achieved for convex losses (Zinkevich (2003)) and strictly convex losses with bounded first and second derivatives (Hazan et al. (2007)) respectively. In special cases like the Aggregating Algorithm (Vovk (1995)) with mixable losses and the Weighted Average Algorithm (Kivinen and Warmuth (1999)) with exp-concave losses, it is possible to achieve O(1) regret bounds. van Erven (2012) has argued that mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses (Van Erven et al. (2012)), we show that it is possible to transform (re-parameterize) any beta-mixable proper loss into a beta-exp-concave composite loss with the same beta.

10:20
Vector-Valued Property Elicitation
SPEAKER: unknown

ABSTRACT. The elicitation of a statistic, or property of a distribution, is the task of devising proper scoring rules, equivalently proper losses, which incentivize an agent or algorithm to truthfully estimate the desired property of the underlying probability distribution. Exploiting the connection of such elicitation to convex analysis, we address the vector-valued property case, which has received relatively little attention in the literature despite its applications to both machine learning and statistics.

We first provide a very general characterization of linear and ratio-of-linear properties, the first of which resolves an open problem by unifying and strengthening several previous characterizations in machine learning and statistics. We then ask which vectors of properties admit nonseparable scores, which cannot be expressed as a sum of scores for each coordinate separately, a natural desideratum for machine learning. We show that linear and ratio-of-linear do admit nonseparable scores, and provide evidence for the conjecture that these are the only such properties (up to link functions). Finally, we provide a general method for producing identification functions and address an open problem by showing that convex maximal level sets are insufficient for elicitability in general.

10:25
Generalized Mixability via Entropic Duality
SPEAKER: unknown

ABSTRACT. Mixability is a property of a loss which characterizes when fast convergence is possible in the game of prediction with expert advice. We show that a key property of mixability generalizes, and the $\exp$ and $\log$ operations present in the usual theory are not as special as one might have thought. In doing so we introduce a more general notion of $\Phi$-mixability where $\Phi$ is a general entropy (\ie, any convex function on probabilities). We show how a property shared by the convex dual of any such entropy yields a natural algorithm (the minimizer of a regret bound) which, analogous to the classical Aggregating Algorithm, is guaranteed a constant regret when used with $\Phi$-mixable losses. We characterize which $\Phi$ have non-trivial $\Phi$-mixable losses and relate $\Phi$-mixability and its associated Aggregating Algorithm to potential-based methods, a Blackwell-like condition, mirror descent, and risk measures from finance. We also define a notion of ``dominance'' between different entropies in terms of bounds they guarantee and conjecture that classical mixability gives optimal bounds, for which we provide some supporting empirical evidence.

10:30
On Consistent Surrogate Risk Minimization and Property Elicitation
SPEAKER: unknown

ABSTRACT. Surrogate risk minimization is a popular framework for supervised learning; property elicitation is a widely studied area in probability forecasting, machine learning, statistics and economics. In this paper, we connect these two themes by showing that calibrated surrogate losses in  supervised learning can essentially be viewed as eliciting or estimating
certain properties of the underlying conditional label distribution that are sufficient to construct an optimal classifier under the target loss of interest. Our study helps to shed light on the design of convex calibrated surrogates. We also give a new framework for designing convex calibrated surrogates under low-noise conditions by eliciting properties that allow one to construct ‘coarse’ estimates of the underlying distribution.
 

10:35
Label optimal regret bounds for online local learning
SPEAKER: unknown

ABSTRACT. We resolve an open question from Christiano (2014b) posed in COLT’14 regarding the optimal dependency of the regret achievable for online local learning on the size of the label set. In this framework the algorithm is shown a pair of items at each step, chosen from a set of n items. The learner then predicts a label for each item, from a label set of size L and receives a real valued payoff. This is a natural framework which captures many interesting scenarios such as collaborative filtering, online gambling, and online max cut among others. Christiano (2014a) designed an efficient online learning algorithm for this problem achieving a regret of O(\sqrt{nL^3T}), where T is the number of rounds. Information theoretically, one can achieve a regret of O(\sqrt{n log LT}). One of the main open questions left in this framework concerns closing the above gap. In this work, we provide a complete answer to the question above via two main results. We show, via a tighter analysis, that the semi-definite programming based algorithm of Christiano (2014a), in fact achieves a regret of O(\sqrt{n L T}). Second, we show a matching computational lower bound. Namely, we show that a polynomial time algorithm for online local learning with lower regret would imply a polynomial time algorithm for the planted clique problem in regimes widely believed to be hard. We prove a similar hardness result under a related conjecture concerning planted dense subgraphs that we put forth. Unlike planted clique, the dense subgraph version does not have quasi-polynomial time algorithms. Computational lower bounds for online learning are relatively rare, and we hope that the ideas developed in this work will lead to lower bounds for other online learning scenarios as well.

10:40-11:10Coffee Break
11:10-12:10 Session 12: Invited Talk
Chair:
11:10
Applications of Learning Theory in Algorithmic Game Theory

ABSTRACT. Algorithmic game theory is a field that uses and extends tools from economics and game theory to reason about fundamental computer science problems. The field is important both for its applications, which span the gamut from network routing to online advertising, and for its remarkably diverse and rich connections to other areas of theoretical computer science, including complexity theory and approximation algorithms. In this talk, we survey two ways in which definitions and tools from learning theory have been crucial to recent advances in algorithmic game theory. First, we outline a theory of robust bounds on the "price of anarchy" --- meaning approximation guarantees for game-theoretic equilibria --- that apply to all outcome sequences generated by no-regret learners playing a multi-player game. Second, we explain how to use concepts from learning theory to make traditional (Bayesian) optimal auction theory operational, replacing the practically problematic "common prior" assumption with a data-driven approach.

12:40-14:30Lunch Break
14:30-15:10 Session 14: Estimation, Generative Models
14:30
Learning the dependence structure of rare events: a non-asymptotic study
SPEAKER: unknown

ABSTRACT. Assessing the probability of occurrence of extreme events is a crucial issue in various fields like finance, insurance, telecommunication or environmental sciences. In a multivariate framework, the tail dependence is characterized by the so-called stable tail dependence function (STDF). Learning this structure is the keystone of multivariate extremes. Although extensive studies have proved consistency and asymptotic normality for the empirical version of the STDF, non-asymptotic bounds are still missing. The main purpose of this paper is to fill this gap. Taking advantage of adapted VC-type concentration inequalities, upper bounds are derived with expected rate of convergence in $O(k^{-1/2})$. The concentration tools involved in this analysis rely on a more general study of maximal deviations in low probability regions, and thus directly apply to the classification of extreme data.

14:50
On Learning Distributions from their Samples
SPEAKER: unknown

ABSTRACT. One of the most natural and important questions in statistical learning is how well a distribution can be approximated from its samples. Surprisingly, this question has so far been resolved for only a few approximation measures, for example the KL-divergence, and even then the answer is ad hoc and not well understood. We resolve the question for three more important approximation measures, $\ell_1$, $\ell_2^2$, and $\chi^2$, and if the probabilities are bounded away from zero, we resolve the question for all smooth $f$-divergence approximation measures, thereby providing a coherent understanding of the rate at which a distribution can be approximated from its samples.

14:55
Optimum Statistical Estimation with Strategic Data Sources
SPEAKER: unknown

ABSTRACT. We propose an optimum mechanism for providing monetary incentives to the data sources of a statistical estimator such as linear regression, so that high quality data is provided at low cost, in the sense that the weighted sum of payments and estimation error is minimized. The mechanism applies to a broad range of estimators, including linear and polynomial regression, kernel regression, and, under some additional assumptions, ridge regression. It also generalizes to several objectives, including minimizing estimation error subject to budget constraints. Besides our concrete results for regression problems, we contribute a mechanism design framework through which to design and analyze statistical estimators whose examples are supplied by workers with cost for labeling said examples.

15:00
Learning Overcomplete Latent Variable Models through Tensor Methods
SPEAKER: unknown

ABSTRACT. We provide guarantees for learning latent variable models emphasizing on the overcomplete regime, where the dimensionality of the latent space exceeds the observed dimensionality. In particular, we consider multiview mixtures, ICA, and sparse coding models. Our main tool is a new algorithm for tensor decomposition that works in the overcomplete regime.

In the semi-supervised setting, we exploit label information to get a rough estimate of the model parameters, and then refine it using the tensor method on unlabeled samples. We establish learning guarantees when the number of components scales as $k=o(d^{p/2})$, where $d$ is the observed dimension, and $p$ is the order of the observed moment employed in the tensor method (usually $p=3,4$). In the unsupervised setting, a simple initialization algorithm based on SVD of the tensor slices is proposed, and the guarantees are provided under the stricter condition that $k\le \beta d$ (where constant $\beta$ can be larger than $1$). For the learning applications, we provide tight sample complexity bounds through novel covering arguments.

15:05
Efficient Sampling for Gaussian Graphical Models via Spectral Sparsification
SPEAKER: unknown

ABSTRACT. Motivated by a sampling problem basic to computational statistical inference, we develop a toolset based on spectral sparsification for a family of fundamental problems involving Gaussian sampling, matrix functionals, and reversible Markov chains. Drawing on the connection between Gaussian graphical models and the recent breakthroughs in spectral graph theory, we give the first nearly linear time algorithm for the following basic matrix problem: Given an n-by-n Laplacian matrix M and a constant -1<= p <= 1, provide efficient access to a sparse n-by-n linear operator C such that M^p \approx C C^T, where \approx denotes spectral similarity. When p is set to -1, this gives the first parallel sampling algorithm that is essentially optimal both in total work and randomness for Gaussian random fields with symmetric diagonally dominant (SDD) precision matrices. It only requires nearly linear work and 2n i.i.d. random univariate Gaussian samples to generate an n-dimensional i.i.d. Gaussian random sample in polylogarithmic depth.

The key ingredient of our approach is an integration of spectral sparsification with multilevel method: Our algorithms are based on factoring M^p into a product of well-conditioned matrices, then introducing powers and replacing dense matrices with sparse approximations. We give two sparsification methods for this approach that may be of independent interest. The first invokes Maclaurin series on the factors, while the second builds on our new nearly linear time spectral sparsification algorithm for random-walk matrix polynomials. We expect these algorithmic advances will also help to strengthen the connection between machine learning and spectral graph theory, two of the most active fields in understanding large data and networks.

15:10-15:40 Session 15: On-Line Learning & Bandits II
15:10
Minimax Fixed-Design Linear Regression
SPEAKER: unknown

ABSTRACT. We consider a linear regression game in which the covariates are known in advance: at each round, the learner predicts a real-value, the adversary reveals a label, and the learner incurs a squared error loss. The aim is to minimize the regret with respect to linear predictions. For a variety of constraints on the adversary's labels, we show that the minimax optimal strategy is linear, with a parameter choice that is reminiscent of ordinary least squares (and as easy to compute). The predictions depend on all covariates, past and future, with a particular weighting assigned to future covariates corresponding to the role that they play in the minimax regret. We study two families of label sequences: box constraints (under a covariate compatibility condition), and a weighted 2-norm constraint that emerges naturally from the analysis. The strategy is adaptive in the sense that it requires no knowledge of the constraint set. We obtain an explicit expression for the minimax regret for these games. For the case of uniform box constraints, we show that, with worst case covariate sequences, the regret is $O(d\log T)$, with no dependence on the scaling of the covariates.

15:15
A Chaining Algorithm for Online Nonparametric Regression
SPEAKER: unknown

ABSTRACT. We consider the problem of online nonparametric regression with arbitrary deterministic sequences. Using ideas from the chaining technique, we design an algorithm that achieves a Dudley-type regret bound similar to the one obtained in a non-constructive fashion by Rakhlin and Sridharan (2014). Our regret bound is expressed in terms of the metric entropy in the sup norm, which yields optimal guarantees when the metric and sequential entropies are of the same order of magnitude. In particular our algorithm is the first one that achieves optimal rates for online regression over Hölder balls. In addition we show for this example how to adapt our chaining algorithm to get a reasonable computational efficiency with similar regret guarantees (up to a log factor).

15:20
First-order regret bounds for combinatorial semi-bandits
SPEAKER: Gergely Neu

ABSTRACT. We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. The performance of the learner is measured in terms of its total expected regret against the best fixed action after $T$ rounds. In this paper, we propose a computationally efficient algorithm that improves existing worst-case guarantees of $O(\sqrt{T})$ to $O(\sqrt{L_T^*})$, where $L_T^*$ is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.

15:25
Online Learning with Feedback Graphs: Beyond Bandits
SPEAKER: unknown

ABSTRACT. We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multi-armed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced $T$-round learning problem. Specifically, we show that any feedback graph belongs to one of three classes: \emph{strongly observable} graphs, \emph{weakly observable} graphs, and \emph{unobservable} graphs. We prove that the first class induces learning problems with $\widetilde\Theta(\alpha^{1/2} T^{1/2})$ minimax regret, where $\alpha$ is the independence number of the underlying graph; the second class induces problems with $\widetilde\Theta(\delta^{1/3}T^{2/3})$ minimax regret, where $\delta$ is the domination number of a certain portion of the graph; and the third class induces problems with linear minimax regret. Our results subsume much of the previous work on learning with feedback graphs and reveal new connections to partial monitoring games. We also show how the regret is affected if the graphs are allowed to vary with time.

15:30
Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem
SPEAKER: unknown

ABSTRACT. We study the $K$-armed dueling bandit problem, a variation of the standard stochastic bandit problem where the feedback is limited to relative comparisons of a pair of arms. We introduce a tight asymptotic regret lower bound that is based on the information divergence. An algorithm that is inspired by the Deterministic Minimum Empirical Divergence algorithm (Honda and Takemura, 2010) is proposed, and its regret is analyzed. The proposed algorithm is found to be the first one with a regret upper bound that matches the lower bound. Experimental comparisons of dueling bandit algorithms show that the proposed algorithm significantly outperforms existing ones.

15:35
Contextual Dueling Bandits
SPEAKER: unknown

ABSTRACT. We consider the problem of learning to choose actions using contextual information when provided with limited feedback in the form of relative pairwise comparisons. We study this problem in the dueling-bandits framework of Yue et al. (2009), which we extend to incorporate context. Roughly, the learner’s goal is to find the best policy, or way of behaving, in some space of policies, although “best” is not always so clearly defined. Here, we propose a new and natural solution concept, rooted in game theory, called a von Neumann winner, a randomized policy that beats or ties every other policy. We show that this notion overcomes important limitations of existing solutions, particularly the Condorcet winner which has typically been used in the past, but which requires strong and often unrealistic assumptions. We then present three efficient algorithms for online learning in our setting, and for approximating a von Neumann winner from batch-like data. The first of these algorithms achieves particularly low regret, even when data is adversarial, although its time and space requirements are linear in the size of the policy space. The other two algorithms require time and space only logarithmic in the size of the policy space when provided access to an oracle for solving classification problems on the space.

15:40-16:00Coffee Break