COLT 2016: THE 29TH ANNUAL CONFERENCE ON LEARNING THEORY
PROGRAM FOR SATURDAY, JUNE 25TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 13: Online Learning I
09:00
Provably manipulation-resistant reputation systems (Best Student Paper Award)

ABSTRACT. Reputation and reliability play a central role in a wide range of applications, from online marketplaces to review aggregators to ridesharing services. Many reputation systems are vulnerable to manipulation, and protected only by keeping algorithms secret, avoiding high-stakes applications, or using side information to identify malicious users. The current situation is reminiscent of pre-modern cryptography, characterized by a patchwork of ad hoc techniques with minimal formal understanding of their security.

We propose a reputation system which provably achieves a very strong correctness guarantee under extremely pessimistic assumptions—it works even given a supermajority of malicious users, converges to optimal behavior after a constant number of interactions per user, does not require repeated interactions, and accommodates time-varying quality of resources.

Our formal model is simple but general. In each period, a user is given an opportunity to inter- act with a resource, and must accept or reject the proposed interaction. If they accept, they receive a payoff in [−1, 1]. Ideally all users would behave honestly, pooling their data and quickly learn- ing which resources are worth interacting with. Our protocol essentially matches this performance when all users are honest, while guaranteeing that adding malicious users or users with varying tastes does very little damage.

We also extend our results to a more challenging setting where users interact with each other rather than with static resources, and where the two parties to an interaction may receive different payoffs.

09:20
On the capacity of information processing systems
SPEAKER: unknown

ABSTRACT. We propose and analyze a family of information processing systems, where a finite set of experts or servers are employed to extract information about a stream of incoming jobs. Each job is associated with a hidden label drawn from some prior distribution. An inspection by an expert produces a noisy outcome that depends both on the job's hidden label and the type of the expert, and occupies the expert for a finite time duration. A decision maker's task is to dynamically assign inspections so that the resulting outcomes can be used to accurately recover the labels of all jobs, while keeping the system stable. Among our chief motivations are applications in crowd-sourcing, diagnostics, and experiment designs, where one wishes to efficiently discover the nature of a large number of items, using a finite pool of computational resources or human agents.

We focus on the capacity of such an information processing system. Given a level of accuracy guarantee, we ask how many experts are needed in order to stabilize the system, and through what inspection architecture. Our main result provides an adaptive inspection policy that is asymptotically optimal in the following sense: the ratio between the required number of experts under our policy and the theoretical optimal converges to one, as the probability of error in label recovery tends to zero.

09:40
Online learning in repeated auctions
SPEAKER: unknown

ABSTRACT. Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical. Comparing our performance against that of the best fixed bid in hindsight, we show that sublinear regret is also achievable in this case. For both the stochastic and adversarial models, we prove matching minimax lower bounds showing our strategies to be optimal up to lower-order terms. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

10:00-10:20Coffee Break
10:20-11:10 Session 14: Statistical inference
10:20
Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh distributions and Determinantal Point Processes
SPEAKER: unknown

ABSTRACT. Strongly Rayleigh distributions are natural generalizations of product and determinantal probability distributions and satisfy the strongest from of negative dependence properties. We show that the ``natural'' Monte Carlo Markov Chain (MCMC) algorithm mixes rapidly in the support of a homogeneous strongly Rayleigh distribution. As a byproduct, our proof implies Markov chains can be used to efficiently generate approximate samples of a $k$-determinantal point process. This answers an open question raised by Deshpande and Rademacher which was studied recently by Kang, Li-Jegelka-Sra, and Rebeschini-Karbasi.

10:30
When Can We Rank Well from Comparisons of $O(n\log n)$ Non-Actively Chosen Pairs?
SPEAKER: unknown

ABSTRACT. Ranking from pairwise comparisons is a ubiquitous problem and has been studied in disciplines ranging from statistics to operations research and from theoretical computer science to machine learning. Here we consider a general setting where outcomes of pairwise comparisons between items $i$ and $j$ are drawn probabilistically by flipping a coin with unknown bias $P_{ij}$, and ask under what conditions on these unknown probabilities one can learn a good ranking from comparisons of only $O(n\log n)$ non-actively chosen pairs. Recent work has established this is possible under the Bradley-Terry-Luce (BTL) and noisy permutation (NP) models. Here we introduce a broad family of conditions on the probabilities $P_{ij}$ under which the resulting preference matrix $\P$ has low rank under some link function, and show these encompass the BTL and Thurstone classes as special cases (but are considerably more general). We then give a new algorithm called low-rank pairwise ranking (LRPR) which provably learns a good ranking from comparisons of only $O(n\log n)$ randomly chosen comparisons under such low-rank models. Our algorithm and corresponding analysis make use of tools from the theory of low-rank matrix approximation, and provide a new perspective on the problem of ranking from pairwise comparisons in non-active settings.

10:40
Asymptotic behavior of $\ell_q$-based Laplacian regularization in semi-supervised learning
SPEAKER: unknown

ABSTRACT. Given a weighted graph with $n$ vertices, consider a real-valued regression problem in a semi-supervised setting, where one observes $n_\ell$ labeled vertices, and the task is to label the remaining ones. We present a theoretical study of $\ell_q$-based Laplacian regularization under a $d$-dimensional geometric random graph model. We provide a variational characterization of the performance of this regularized learner as $n$ grows to infinity while $n_\ell$ stays constant; the associated optimality conditions lead to a partial differential equation that must be satisfied by the associated function estimate $\widehat f$. From this formulation we derive several predictions on the limiting behavior the $d$-dimensional function $\widehat f$, including (a) a phase transition in its smoothness at the threshold $q = d + 1$; and (b) a tradeoff between smoothness and sensitivity to the underlying unlabeled data distribution $P$. Thus, over the range $q \leq d$, the function estimate $\widehat f$ is degenerate and ``spiky,'' whereas for $q\geq d+1$, the function estimate $\widehat f$ is smooth. We show that the effect of the underlying density vanishes monotonically with $q$, such that in the limit $q = \infty$, corresponding to the so-called Absolutely Minimal Lipschitz Extension, the estimate $\widehat f$ is independent of the distribution $P$. Under the assumption of semi-supervised smoothness, ignoring $P$ can lead to poor statistical performance; in particular, we construct a specific example for $d=1$ to demonstrate that $q=2$ has lower risk than $q=\infty$ due to the former penalty adapting to $P$ and the latter ignoring it. We also provide simulations that verify the accuracy of our predictions for finite sample sizes. Together, these properties show that $q = d+1$ is an optimal choice, yielding a function estimate $\widehat f$ that is both smooth and non-degenerate, while remaining maximally sensitive to $P$.

10:50
Optimal rates for total variation denoising
SPEAKER: unknown

ABSTRACT. Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates for a large class of image models such as isotonic, Hölder smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two- dimensional grid such eigenvectors delocalization and spectral decay. It can be extended to several random graph models with bounded degree.

11:00
Aggregation of supports along the Lasso path

ABSTRACT. In linear regression with fixed design, we propose two procedures that aggregate a data-driven collection of supports. The collection is a subset of the $2^p$ possible supports and both its cardinality and its elements can depend on the data. The procedures satisfy oracle inequalities with no assumption on the design matrix. Then we use these procedures to aggregate the supports that appear on the regularization path of the Lasso in order to construct an estimator that mimics the best Lasso estimator. If the restricted eigenvalue condition on the design matrix is satisfied, then this estimator achieves optimal prediction bounds. Finally, we discuss the computational cost of these procedures.

11:10-11:30Break
12:30-14:30Lunch Break
14:30-15:40 Session 16: Supervised learning
14:30
Adaptive Learning with Robust Generalization Guarantees
SPEAKER: unknown

ABSTRACT. The traditional notion of generalization---i.e., learning a hypothesis whose empirical error is close to its true error—is surprisingly brittle. As has recently been noted (Dwork et al., 2015c), even if several algorithms have this guarantee in isolation, the guarantee need not hold if the algorithms are composed adaptively. In this paper, we study three notions of generalization---increasing in strength---that are robust to postprocessing and amenable to adaptive composition, and examine the relationships between them. We call the weakest such notion Robust Generalization. A second, intermediate, notion is the stability guarantee known as differential privacy. The strongest guarantee we consider we call Perfect Generalization. We prove that hypothesis class that is PAC learnable in a robustly generalizing fashion, albeit with an exponential blowup in sample complexity. We conjecture that a stronger version of this theorem also holds (and, in fact, it would, subject to a longstanding conjecture (Littlestone and Warmuth, 1986; Warmuth, 2003)). It was previously known that differentially private algorithms satisfy robust generalization. In this paper, we show that robust generalization is a strictly weaker concept, and that there is a learning task that can be carried out subject to robust generalization guarantees, yet cannot be carried out subject to differential privacy, answering an open question of Dwork et al. (2015a). We also show that perfect generalization is a strictly stronger guarantee than differential privacy, but that, nevertheless, many learning tasks can be carried out subject to the guarantees of perfect generalization.

14:40
Interactive Algorithms: from Pool to Stream
SPEAKER: unknown

ABSTRACT. We consider interactive algorithms in the pool-based setting, and in the stream-based setting. Interactive algorithms observe suggested elements (representing actions or queries), and interactively select some of them and receive responses. Pool-based algorithms can select elements at any order, while stream-based algorithms observe elements in sequence, and can only select elements immediately after observing them. We assume that the suggested elements are generated independently from some source distribution, and ask what is the stream size required for emulating a pool algorithm with a given pool size. We provide algorithms and matching lower bounds for general pool algorithms, and for utility-based pool algorithms. We further show that a maximal gap between the two settings exists also in the special case of active learning for binary classification.

14:50
The Extended Littlestone’s Dimension for Learning with Mistakes and Abstentions
SPEAKER: unknown

ABSTRACT. This paper studies classification with an abstention option in the online setting. In this setting, examples arrive sequentially, the learner is given a hypothesis class $\calH$, and the goal of the learner is to either predict a label on each example or abstain, while ensuring that it does not make more than a pre-specified number of mistakes when it does predict a label.

Previous work on this problem has left open two main challenges. First, not much is known about the optimality of algorithms, and in particular, about what an optimal algorithmic strategy is for any individual hypothesis class. Second, while the realizable case has been studied, the more realistic non-realizable scenario is not well-understood. In this paper, we address both challenges. First, we provide a novel measure, called the Extended Littlestone's Dimension, which captures the number of abstentions needed to ensure a certain number of mistakes. Second, we explore the non-realizable case, and provide upper and lower bounds on the number of abstentions required by an algorithm to guarantee a specified number of mistakes.

15:00
Learning Combinatorial Functions from Pairwise Comparisons
SPEAKER: unknown

ABSTRACT. A large body of work in machine learning has focused on the problem of learning a close approximation to an underlying combinatorial function, given a small set of labeled examples. However, for real-valued functions, cardinal labels might not be accessible, or it may be difficult for an expert to consistently assign real-valued labels over the entire set of examples. For instance, it is notoriously hard for consumers to reliably assign values to bundles of merchandise. Instead, it might be much easier for a consumer to report which of two bundles she likes better.

With this motivation in mind, we consider an alternative learning model, wherein the algorithm must learn the underlying function up to pairwise comparisons, from pairwise comparisons. In this model, we present a series of novel algorithms that learn over a wide variety of combinatorial function classes. These range from graph functions to broad classes of valuation functions that are fundamentally important in microeconomic theory, social networks, and machine learning, such as coverage, submodular, XOS, and subadditive functions, as well as functions with sparse Fourier support.

15:10
Learning Simple Auctions
SPEAKER: unknown

ABSTRACT. This paper presents a general framework for proving sample complexity bound on the problem of learning a revenue-maximizing simple auction from sampled multiparameter valuations. We propose a particular way two-step technique to bound the pseudo-dimension (a real-valued analogue of the VC dimension) of a class of well-structured auctions. First, one analyzes the auctions’ allocation rules, and show that these are linearly separable, or compressible, or, given a sample of size m, can label that sample with at most mα distinct allocations. Second, fixing some output of the allocation function on the sample, if the revenue function for all auctions consistent with that allocation function on the sample has pseudo-dimension d on any such sample, then the entire class of auctions will have pseudo-dimension O ̃(dα). We show this framework can be used to analyze many simple auction classes from the mechanism design literature, giving simple proofs of small sample complexity for learning revenue-optimal grand bundle pricings, item pricings, and second-price item auctions with item-specific reserve prices.

15:20
Preference-based Teaching
SPEAKER: unknown

ABSTRACT. We introduce a new model of teaching named ``preference-based teaching'' and a corresponding complexity parameter---the preference-based teaching dimension (PBTD)---representing the worst-case number of examples needed to teach any concept in a given concept class. Although the PBTD coincides with the well-known recursive teaching dimension (RTD) on finite classes, it is radically different on infinite ones: the RTD becomes infinite already for trivial infinite classes (such as half-intervals) whereas the PBTD evaluates to reasonably small values for a wide collection of infinite classes including classes consisting of so-called closed sets w.r.t.~a given closure operator, including various classes related to linear sets over $\nats_0$ (whose RTD had been studied quite recently) and including the class of Euclidean half-spaces (and some other geometric classes). On top of presenting these concrete results, we provide the reader with a theoretical framework (of a combinatorial flavor) which helps to derive bounds on the PBTD.

15:40-16:10Coffee Break
16:10-17:20 Session 17: Optimization
16:10
Dropping Convexity for Faster Semi-definite Optimization

ABSTRACT. In this paper, we study the minimization of a convex function $f(X)$ over the space of $n\times n$ positive semidefinite matrices $X\succeq 0$, but when the problem is recast as the non-convex problem $\min_U g(U)$ where $g(U) := f(UU^\top)$, with $U$ being an $n\times r$ matrix and $r\leq n$\footnote{The two problems are equivalent when $r=n$, and $r

We provide a rule for selecting the step size, and with this choice show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$ -- the error after $k$ steps is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. Note that $g$ is {\em not} locally convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle.

FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization; to the best of our knowledge, ours is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.

16:30
Efficient approaches for escaping higher order saddle points in non-convex optimization
SPEAKER: unknown

ABSTRACT. Local search heuristics for non-convex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NP-hard to extend this further to finding fourth order local optima.

16:40
Gradient Descent only Converges to Minimizers
SPEAKER: unknown

ABSTRACT. We show that gradient descent generically does not converge to saddle points. This is proved using the Stable Manifold theorem from dynamical systems theory.

16:50
First-order Methods for Geodesically Convex Optimization
SPEAKER: unknown

ABSTRACT. Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by developing iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upper bounds for the global complexity of deterministic and stochastic (sub)gradient methods for optimizing smooth and nonsmooth g-convex functions, both with and without strong g-convexity. Our analysis also reveals how the manifold geometry, especially \emph{sectional curvature}, impacts convergence rates. To the best of our knowledge, our work is the first to provide global complexity analysis for first-order algorithms for general g-convex optimization.

17:00
A Light Touch for Heavily Constrained SGD
SPEAKER: unknown

ABSTRACT. Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update. For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration. Theoretical analysis shows a compelling trade-off between per-iteration work and the number of iterations needed on problems with a large number of constraints.

17:10
Highly-Smooth Zero-th Order Online Optimization
SPEAKER: unknown

ABSTRACT. The minimization of convex functions which are only available through partial and noisy information is a key methodological problem in many disciplines. In this paper we consider convex optimization with noisy zero-th order information, that is noisy function evaluations at any desired point. We focus on problems with high degrees of smoothness, such as logistic regression. We show that as opposed to gradient-based algorithms, high-order smoothness may be used to improve estimation rates, with a precise dependence of our upper-bounds on the degree of smoothness. In particular, we show that for infinitely differentiable functions, we recover the same dependence on sample size as gradient-based algorithms, with an extra dimension-dependent factor. This is done for both convex and strongly-convex functions, with finite horizon and anytime algorithms. Finally, we also recover similar results in the online optimization setting.

17:20-17:40Break