COLT 2015: 28TH ANNUAL CONFERENCE ON LEARNING THEORY
PROGRAM FOR FRIDAY, JULY 3RD
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-09:40 Session 3: Computational Learning
09:00
An Almost Optimal PAC Algorithm (Best Paper Award!)
SPEAKER: Hans Simon

ABSTRACT. The best currently known general lower and upper bounds on the number of labeled examples needed for learning a concept class in the PAC framework (the realizable case) do not perfectly match: they leave a gap of order $\log(1/\eps)$ (resp.~a gap which is logarithmic in another one of the relevant parameters). It is an unresolved question whether there exists an ``optimal PAC algorithm'' which establishes a general upper bound with precisely the same order of magnitude as the general lower bound. According to a result of Auer and Ortner, there is no way for showing that arbitrary consistent algorithms are optimal because they can provably differ from optimality by factor $\log(1/\eps)$. In contrast to this result, we show that every consistent algorithm $L$ (even a provably suboptimal one) induces a family $(L_K)_{K\ge1}$ of PAC algorithms (with $2K-1$ calls of $L$ as a subroutine) which come very close to optimality: the number of labeled examples needed by $L_K$ exceeds the general lower bound only by factor $\ell_K(1/\eps)$ where $\ell_K$ denotes (a truncated version of) the $K$-times iterated logarithm. Moreover, $L_K$ is applicable to any concept class $C$ of finite VC-dimension and it can be implemented efficiently whenever the consistency problem for $C$ is feasible. We show furthermore that, for every consistent algorithm $L$, $L_2$ is an optimal PAC algorithm for precisely the same concept classes which were used by Auer and Ortner for showing the existence of suboptimal consistent algorithms. This can be seen as an indication that $L_K$ may have an even better performance than it is suggested by our worstcase analysis.

09:20
Cortical Learning via Prediction
SPEAKER: unknown

ABSTRACT. What is the mechanism of learning in the brain? Despite breathtaking advances in neuroscience, we do not seem close to an answer. We introduce PJOIN (for “predictive join”), a primitive that com- bines and extends the operations of JOIN and LINK that are the basis of Valiant’s computational theory of cortex. We show that PJOIN can be implemented naturally in Valiant’s neuroidal model, a conservative formal model of cortical computation. Using PJOIN — and almost nothing else — we give a simple algorithm for unsupervised learning of arbitrary ensembles of binary patterns. This algorithm relies crucially on prediction, and entails significant downward traffic (“feedback”) while parsing stimuli. Prediction and feedback are well-known features of neural cognition and, as far as we know, this is the first theoretical prediction of their essential role in learning.

09:40-10:40 Session 4: Optimization I
09:40
On the Complexity of Learning with Kernels
SPEAKER: unknown

ABSTRACT. A well-recognized limitation of kernel learning is the requirement to handle a kernel matrix, whose size is quadratic in the number of training examples. Many methods have been proposed to reduce this computational cost, mostly by using a subset of the kernel matrix entries, or some form of low-rank matrix approximation, or a random projection method. In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix. We show that there are kernel learning problems where no such method will lead to non-trivial computational savings. Our results also quantify how the problem difficulty depends on parameters such as the nature of the loss function, the regularization parameter, the norm of the desired predictor, and the kernel matrix rank. Our results also suggest cases where more efficient kernel learning might be possible.

10:00
Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition
SPEAKER: unknown

ABSTRACT. We analyze stochastic gradient descent for optimizing non-convex functions. For non-convex functions often it is good to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization, and show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points.

Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with convergence guarantee.

10:20
Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity
SPEAKER: unknown

ABSTRACT. We present a simple, general technique for reducing the sample complexity of matrix and tensor decomposition algorithms applied to distributions. We use the technique to give a polynomial-time algorithm for standard ICA with sample complexity nearly linear in the dimension, thereby improving substantially on previous bounds. The analysis is based on properties of random polynomials, namely the spacings of an ensemble of polynomials. Our technique also applies to other applications of tensor decompositions, including spherical Gaussian mixture models.

10:25
Adaptive recovery of signals by convex optimization
SPEAKER: unknown

ABSTRACT. We present a theoretical framework for adaptive estimation and prediction of signals of unknown structure in the presence of noise. It addresses two intertwined challenges: (i) designing optimal statistical estimators; (ii) designing efficient numerical algorithms. In particular, we establish oracle inequalities for performance of adaptive procedures, which rely upon convex optimization and thus can be efficiently implemented. We illustrate some potential applications of the proposed approach.

10:30
Competing with the Empirical Risk Minimizer in a Single Pass
SPEAKER: unknown

ABSTRACT. Many optimization problems that arise in science and engineering are those in which we only have a stochastic approximation to the underlying objective (e.g. estimation problems such as linear regression). That is, given some distribution $\mathcal{D}$ over functions $\psi$, we wish to minimize $P(x) = \mathbb{E}_{\psi \sim \mathcal{D}}[\psi(x)]$, using as few samples from $\mathcal{D}$ as possible. In the absence of computational constraints, the empirical risk minimizer (ERM) -- the minimizer on a sample average of observed data -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal is to do as well as the empirical risk minimizer, on every problem, while minimizing the use of computational resources such as running time and space usage.

This work provides a simple streaming algorithm with performance guarantees competitive with that of the ERM. In particular, under standard regularity assumptions on $\mathcal{D}$, our algorithm enjoys the following properties:

1. The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. 2. The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem $\mathcal{D}$ (even considering constant factors). 3. The algorithm's performance depends on the initial error at a rate that decreases super-polynomially.

Moreover, we quantify the rate of convergence of both our algorithm and that of the ERM, showing that, after a number of samples that depend only on the regularity parameters, the streaming algorithm rapidly approaches the leading order rate of ERM in expectation.

10:35
Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization
SPEAKER: unknown

ABSTRACT. In this paper we derive \textit{high probability} lower and upper bounds on the excess risk of stochastic optimization of exponentially concave loss functions. Exponentially concave loss functions encompass several fundamental problems in machine learning such as squared loss in linear regression, logistic loss in classification, and negative logarithm loss in portfolio management. We demonstrate an $O(d \log T/T)$ upper bound on the excess risk of stochastic online Newton step algorithm, and an $O(d/T)$ lower bound on the excess risk of general stochastic exponentially concave optimization methods, indicating that the obtained upper bound is optimal up to a logarithmic factor. The analysis of upper bound is based on a novel concentration inequality for bounding martingales, which is interesting by its own right, and the proof technique used to achieve the lower bound is a probabilistic method and relies on an information-theoretic minimax analysis.

10:40-11:10Coffee Break
11:10-12:30 Session 5: On-Line Learning & Bandits I
11:10
From Averaging to Acceleration, There is Only a Step-size
SPEAKER: unknown

ABSTRACT. We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate O(1/n^2) where n is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system, showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggest an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.

11:30
On-Line Learning Algorithms for Path Experts with Non-Additive Losses
SPEAKER: unknown

ABSTRACT. We consider two broad families of non-additive loss functions covering a large number of applications: rational losses and tropical losses. We give new algorithms extending the Follow-the-Perturbed-Leader (FPL) algorithm to both of these families of loss functions and similarly give new algorithms extending the Randomized Weighted Majority (RWM) algorithm to both of these families. We prove that the time complexity of our extensions to rational losses of both FPL and RWM is polynomial and present regret bounds for both. We further show that these algorithms can play a critical role in improving performance in applications such as structured prediction.

11:50
Achieving All with No Parameters: Adaptive NormalHedge
SPEAKER: unknown

ABSTRACT. We study the classic online learning problem of predicting with expert advice, and propose a truly parameter-free and adaptive algorithm that achieves several objectives simultaneously without using any prior information. The main component of this work is an improved version of the NormalHedge.DT algorithm (Luo and Schapire, 2014), called AdaNormalHedge. On one hand, this new algorithm ensures small regret when the competitor has small loss and almost constant regret when the losses are stochastic. On the other hand, the algorithm is able to compete with any convex combination of the experts simultaneously, with a regret in terms of the relative entropy of the prior and the competitor. This resolves an open problem proposed by Chaudhuri et al. (2009) and Chernov and Vovk (2010). Moreover, we extend the results to the sleeping expert setting and provide two applications to illustrate the power of AdaNormalHedge: 1) competing with time-varying unknown competitors and 2) predicting almost as well as the best pruning tree. Our results on these applications significantly improve previous work from different aspects, and a special case of the first application resolves another open problem proposed by Warmuth and Koolen (2014) on whether one can simultaneously achieve optimal shifting regret for both adversarial and stochastic losses.

11:55
Second-order Quantile Methods for Experts and Combinatorial Games
SPEAKER: unknown

ABSTRACT. We aim to design strategies for sequential decision making that adjust to the difficulty of the learning problem. We study this question both in the setting of prediction with expert advice, and for more general combinatorial decision tasks. We are not satisfied with just guaranteeing minimax regret rates, but we want our algorithms to perform significantly better on easy data. Two popular ways to formalize such adaptivity are second-order regret bounds and quantile bounds. The underlying notions of `easy data', which may be paraphrased as ``the learning problem has small variance'' and ``multiple decisions are useful'', are synergetic. But even though there are sophisticated algorithms that exploit one of the two, no existing algorithm is able to adapt to both.

In this paper we outline a new method for obtaining such adaptive algorithms, based on a potential function that aggregates a range of learning rates (which are essential tuning parameters). By choosing the right prior we construct efficient algorithms and show that they reap both benefits by proving the first bounds that are both second-order and incorporate quantiles.

12:00
Online Density Estimation of Bradley-Terry Models
SPEAKER: unknown

ABSTRACT. We consider an online density estimation problem for Bradley-Terry models which determine the probability of a match result between any pair in the set of $n$ teams. An annoying issue is that the loss function is not convex. A standard solution to avoid the non-convexity is to change variables so that the new loss function w.r.t.~new variables is convex. But, then the radius $K$ of the new domain might be huge or infinite in general. When $K$ is regarded as a constant, standard algorithms OGD and ONS have regret bounds $O(n^{\frac{1}{2}}(\ln K)\sqrt{T})$ and $O(n^{\frac{3}{2}}K\ln T)$, respectively.

First, we derive a $\Theta(T^{n-1})$ bound of the radius $K$ of the reparameterized convex domain under a mild assumption. The bound implies a worst case where the regret bounds of OGD and ONS are $ O(n^{\frac{3}{2}}\sqrt{T}\ln T)$ and $\tilde{O}(n^{\frac{3}{2}}(T)^{n-1})$, respectively.

Then, we propose a version of Follow the Regularized Leader (FTRL) whose regret bound is close to the minimum of those of OGD and ONS. Therefore, our algorithm is competitive both when $K$ is constant or $O(T^{n-1})$. In particular, with an appropriate parameter, our algorithm achieves a better regret bound $O(n^{\frac{5}{2}}T^{\frac{1}{3}} \ln T)$ in the worst case. In addition, our algorithm works without any knowledge of the radius, which is a practical advantage.

12:05
Hierarchies of Relaxations for Online Prediction Problems with Evolving Constraints
SPEAKER: unknown

ABSTRACT. We propose a framework for online prediction with a benchmark defined through an evolving constraint set. We analyze a method based on random playout. While computing the optimal decision in hindsight might be hard due to the combinatorial constraints, we provide polynomial-time prediction algorithms based on Lasserre semidefinite hierarchy. Since the predictions are improper, the algorithm only needs the value of an online relaxation and not the integral solution. We provide a generic regret bound based on the integrality gap of Lasserre hierarchy at level $r$, thus establishing a link between the time required for solving the semidefinite relaxation and the associated regret bound in terms of Rademacher complexity. Our results are applicable for problems involving prediction over evolving graphs with stochastic side information.

12:10
On the Complexity of Bandit Linear Optimization
SPEAKER: Ohad Shamir

ABSTRACT. We study the attainable regret for online linear optimization problems with bandit feedback, where unlike the full-information setting, the player can only observe its own loss rather than the full loss vector. We show that the price of bandit information in this setting can be as large as $d$, disproving the well-known conjecture (\cite{dani2007price}) that the regret for bandit linear optimization is at most $\sqrt{d}$ times the full-information regret. Surprisingly, this is shown using ``trivial'' modifications of standard domains, which have no effect in the full-information setting. This and other results we present highlight some interesting differences between full-information and bandit learning, which were not considered in previous literature.

12:15
Bandit Convex Optimization: sqrt{T} Regret in One Dimension
SPEAKER: unknown

ABSTRACT. We analyze the minimax regret of the adversarial bandit convex optimization problem. Focusing on the one-dimensional case, we prove that the minimax regret is $\widetilde\Theta(\sqrt{T})$ and partially resolve a decade-old open problem. Our analysis is non-constructive, as we do not present a concrete algorithm that attains this regret rate. Instead, we use minimax duality to reduce the problem to a Bayesian setting, where the convex loss functions are drawn from a worst-case distribution, and then we solve the Bayesian version of the problem with a variant of Thompson Sampling. Our analysis features a novel use of convexity, formalized as a ``local-to-global'' property of convex functions, that may be of independent interest.

12:20
Batched Bandit Problems
SPEAKER: unknown

ABSTRACT. Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic multi-armed bandits under the constraint that the employed policy must split trials into a small number of batches. Our results show that a very small number of batches gives already close to minimax optimal regret bounds and we also evaluate the number of trials in each batch. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.

12:25
Learnability of Solutions to Conjunctive Queries: The Full Dichotomy
SPEAKER: unknown

ABSTRACT. The problem of learning the solution space of an unknown formula has been studied in multiple embodiments in computational learning theory. In this article, we study a family of such learning problems; this family contains, for each relational structure, the problem of learning the solution space of an unknown conjunctive query evaluated on the structure. A progression of results aimed to classify the learnability of each of the problems in this family, and thus far a culmination thereof was a positive learnability result generalizing all previous ones. This article completes the classification program towards which this progression of results strived, by presenting a negative learnability result that complements the mentioned positive learnability result. In order to obtain our negative result, we make use of universal-algebraic concepts, and our result is phrased in terms of the varietal property of non-congruence modularity.

12:30-14:30Lunch Break
14:30-15:40 Session 6: Classification
14:30
MCMC Learning
SPEAKER: unknown

ABSTRACT. The theory of learning under the uniform distribution is rich and deep, with connections to cryptography, computational complexity, and the analysis of boolean functions to name a few areas. This theory however is very limited due to the fact that the uniform distribution and the corresponding Fourier basis are rarely encountered as a statistical model.

A family of distributions that vastly generalizes the uniform distribution on the Boolean cube is that of distributions represented by Markov Random Fields (MRF). Markov Random Fields are one of the main tools for modeling high dimensional data in many areas of statistics and machine learning.

In this paper we initiate the investigation of extending central ideas, methods and algorithms from the theory of learning under the uniform distribution to the setup of learning concepts given examples from MRF distributions. In particular, our results establish a novel connection between properties of MCMC sampling of MRFs and learning under the MRF distribution.

14:50
Learning and inference in the presence of corrupted inputs
SPEAKER: unknown

ABSTRACT. We consider a model where given an uncorrupted input an adversary can corrupt it to one out of $m$ corrupted inputs. We model the classification and inference problems as a zero-sum game between a learner, minimizing the expected error, and an adversary, maximizing the expected error. The value of this game is the optimal error rate achievable.

For learning using a limited hypothesis class $\calH$ over corrupted inputs, we give an efficient algorithm that given an uncorrupted sample returns a hypothesis $h\in H$ whose error on adversarially corrupted inputs is near optimal. Our algorithm uses as a blackbox an oracle that solves the ERM problem for the hypothesis class $\calH$. We provide a generalization bound for our setting, showing that for a sufficiently large sample, the performance on the sample and future unseen corrupted inputs will be similar. This gives an efficient learning algorithm for our adversarial setting, based on an ERM oracle.

We also consider an inference related setting of the problem, where given a corrupted input, the learner queries the target function on various uncorrupted inputs and generates a prediction regarding the given corrupted input. There is no limitation on the prediction function the learner may generate, so implicitly the hypothesis class includes all possible hypotheses. In this setting we characterize the optimal learner policy as a minimum vertex cover in a given bipartite graph, and the optimal adversary policy as a maximum matching in the same bipartite graph. We design efficient local algorithms for approximating minimum vertex cover in bipartite graphs, which implies an efficient near optimal algorithm for the learner.

15:10
A PTAS for Agnostically Learning Halfspaces
SPEAKER: Amit Daniely

ABSTRACT. We present a PTAS for agnostically learning halfspaces w.r.t. the uniform distribution on the $d$ dimensional sphere. Namely, we show that for every $\mu>0$ there is an algorithm that runs in time $\poly\left(d,\frac{1}{\epsilon}\right)$, and is guaranteed to return a classifier with error at most $(1+\mu)\opt+\epsilon$, where $\opt$ is the error of the best halfspace classifier. This improves on Awasthi, Balcan and Long who showed an algorithm with an (unspecified) constant approximation ratio. Our algorithm combines the classical technique of polynomial regression (e.g. LMN89, KKMS05), together with the new localization technique of ABL14.

15:15
Convex Risk Minimization and Conditional Probability Estimation
SPEAKER: unknown

ABSTRACT. This manuscript strengthens the link between convex risk minimization and conditional probability estimation, a connection already notable for establishing consistency results (Friedman et al., 2000; Zhang, 2004b; Bartlett et al., 2006). Specifically, this manuscript first shows that a loss function, linear space of predictors, and probability measure together define a unique optimal conditional probability model, moreover one which may be attained by the usual convex risk minimization. This result is proved in infinite dimensions, and thus gives a concrete convergence target for unregularized methods like boosting which can fail to have minimizers. Second, this convergence result is refined in finitely many dimensions to hold for empirical risk minimization. This uniform convergence result exhibits no dependence on the norms of its predictors, and thus can justify the practical effectiveness of minimally-regularized optimization schemes.

15:20
Efficient Learning of Linear Separators under Bounded Noise
SPEAKER: unknown

ABSTRACT. We study the learnability of linear separators in $\Re^d$ in the presence of bounded (a.k.a Massart) noise. This is a realistic generalization of the random classification noise model, where the adversary can flip each example $x$ with probability $\eta(x) \leq \eta$. We provide the first polynomial time algorithm that can learn linear separators to arbitrary small excess error in this noise model under the uniform distribution over the unit sphere in $\Re^d$. While widely studied in the statistical learning theory community in the context of getting faster convergence rates, computationally efficient algorithms in this model had remained elusive. Our work provides the first evidence that one can indeed design algorithms achieving arbitrary small excess errors in polynomial time under this realistic noise model and thus opens up a new and exciting line of research.

We additionally provide lower bounds showing that popular algorithms such as hinge loss minimization and averaging cannot lead to arbitrary small excess error under Massart noise, even under the uniform distribution. Our work instead, makes use of a margin based technique developed in the context of active learning. As a result, our algorithm is also an active learning algorithm with only a logarithmic dependence in the desired excess error $\epsilon$.

15:25
Optimally Combining Classifiers Using Unlabeled Data
SPEAKER: unknown

ABSTRACT. We develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifiers can perform significantly better than any single classifier.

15:30
$S^2$: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric Classification
SPEAKER: unknown

ABSTRACT. This paper investigates the problem of active learning for binary label prediction on a graph. We introduce a simple and label-efficient algorithm called $S^2$ for this task. At each step, $S^2$ selects the vertex to be labeled based on the structure of the graph and all previously gathered labels. Specifically, $S^2$ queries for the label of the vertex that bisects the {\em shortest shortest} path between any pair of oppositely labeled vertices. We present a theoretical estimate of the number of queries $S^2$ needs in terms of a novel parametrization of the complexity of binary functions on graphs. We also present experimental results demonstrating the performance of $S^2$ on both real and synthetic data. While other graph-based active learning algorithms have shown promise in practice, our algorithm is the first with both good performance and theoretical guarantees. Finally, we demonstrate the implications of the $S^2$ algorithm to the theory of nonparametric active learning. In particular, we show that $S^2$ achieves near minimax optimal excess risk for an important class of nonparametric classification problems.

15:35
Hierarchical label queries with data-dependent partitions
SPEAKER: unknown

ABSTRACT. Given a joint distribution $P_{X, Y}$ over a space $\X$ and a label set $\Y=\braces{0, 1}$, we consider the problem of recovering the labels of an unlabeled sample with as few label queries as possible. Recovered labels can be passed to a passive learner, thus turning the procedure into an active learning approach.

We analyze a family of labeling procedures based on a hierarchical clustering of the data. While such labeling procedures have been studied in the past, we provide a new parametrization of $P_{X, Y}$ that captures their behavior in general low-noise settings, and which accounts for data-dependent clustering, thus providing new theoretical underpinning to practically used tools.

15:40-16:00Coffee Break
18:45-23:00 Session : Cocktail on top of Zamansky Tower

Cocktails and Light Fare will be served on the top floor of a high university building. We booked the room until 11 PM so that there is no rush, but note that this is not really a dinner.