View: session overviewtalk overview
09:00 | Provably manipulation-resistant reputation systems (Best Student Paper Award) SPEAKER: Paul Christiano ABSTRACT. Reputation and reliability play a central role in a wide range of applications, from online marketplaces to review aggregators to ridesharing services. Many reputation systems are vulnerable to manipulation, and protected only by keeping algorithms secret, avoiding high-stakes applications, or using side information to identify malicious users. The current situation is reminiscent of pre-modern cryptography, characterized by a patchwork of ad hoc techniques with minimal formal understanding of their security. We propose a reputation system which provably achieves a very strong correctness guarantee under extremely pessimistic assumptions—it works even given a supermajority of malicious users, converges to optimal behavior after a constant number of interactions per user, does not require repeated interactions, and accommodates time-varying quality of resources. Our formal model is simple but general. In each period, a user is given an opportunity to inter- act with a resource, and must accept or reject the proposed interaction. If they accept, they receive a payoff in [−1, 1]. Ideally all users would behave honestly, pooling their data and quickly learn- ing which resources are worth interacting with. Our protocol essentially matches this performance when all users are honest, while guaranteeing that adding malicious users or users with varying tastes does very little damage. We also extend our results to a more challenging setting where users interact with each other rather than with static resources, and where the two parties to an interaction may receive different payoffs. |
09:20 | On the capacity of information processing systems SPEAKER: unknown ABSTRACT. We propose and analyze a family of information processing systems, where a finite set of experts or servers are employed to extract information about a stream of incoming jobs. Each job is associated with a hidden label drawn from some prior distribution. An inspection by an expert produces a noisy outcome that depends both on the job's hidden label and the type of the expert, and occupies the expert for a finite time duration. A decision maker's task is to dynamically assign inspections so that the resulting outcomes can be used to accurately recover the labels of all jobs, while keeping the system stable. Among our chief motivations are applications in crowd-sourcing, diagnostics, and experiment designs, where one wishes to efficiently discover the nature of a large number of items, using a finite pool of computational resources or human agents. We focus on the capacity of such an information processing system. Given a level of accuracy guarantee, we ask how many experts are needed in order to stabilize the system, and through what inspection architecture. Our main result provides an adaptive inspection policy that is asymptotically optimal in the following sense: the ratio between the required number of experts under our policy and the theoretical optimal converges to one, as the probability of error in label recovery tends to zero. |
09:40 | Online learning in repeated auctions SPEAKER: unknown ABSTRACT. Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical. Comparing our performance against that of the best fixed bid in hindsight, we show that sublinear regret is also achievable in this case. For both the stochastic and adversarial models, we prove matching minimax lower bounds showing our strategies to be optimal up to lower-order terms. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type. |
10:20 | Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh distributions and Determinantal Point Processes SPEAKER: unknown ABSTRACT. Strongly Rayleigh distributions are natural generalizations of product and determinantal probability distributions and satisfy the strongest from of negative dependence properties. We show that the ``natural'' Monte Carlo Markov Chain (MCMC) algorithm mixes rapidly in the support of a homogeneous strongly Rayleigh distribution. As a byproduct, our proof implies Markov chains can be used to efficiently generate approximate samples of a $k$-determinantal point process. This answers an open question raised by Deshpande and Rademacher which was studied recently by Kang, Li-Jegelka-Sra, and Rebeschini-Karbasi. |
10:30 | When Can We Rank Well from Comparisons of $O(n\log n)$ Non-Actively Chosen Pairs? SPEAKER: unknown ABSTRACT. Ranking from pairwise comparisons is a ubiquitous problem and has been studied in disciplines ranging from statistics to operations research and from theoretical computer science to machine learning. Here we consider a general setting where outcomes of pairwise comparisons between items $i$ and $j$ are drawn probabilistically by flipping a coin with unknown bias $P_{ij}$, and ask under what conditions on these unknown probabilities one can learn a good ranking from comparisons of only $O(n\log n)$ non-actively chosen pairs. Recent work has established this is possible under the Bradley-Terry-Luce (BTL) and noisy permutation (NP) models. Here we introduce a broad family of conditions on the probabilities $P_{ij}$ under which the resulting preference matrix $\P$ has low rank under some link function, and show these encompass the BTL and Thurstone classes as special cases (but are considerably more general). We then give a new algorithm called low-rank pairwise ranking (LRPR) which provably learns a good ranking from comparisons of only $O(n\log n)$ randomly chosen comparisons under such low-rank models. Our algorithm and corresponding analysis make use of tools from the theory of low-rank matrix approximation, and provide a new perspective on the problem of ranking from pairwise comparisons in non-active settings. |
10:40 | Asymptotic behavior of $\ell_q$-based Laplacian regularization in semi-supervised learning SPEAKER: unknown ABSTRACT. Given a weighted graph with $n$ vertices, consider a real-valued regression problem in a semi-supervised setting, where one observes $n_\ell$ labeled vertices, and the task is to label the remaining ones. We present a theoretical study of $\ell_q$-based Laplacian regularization under a $d$-dimensional geometric random graph model. We provide a variational characterization of the performance of this regularized learner as $n$ grows to infinity while $n_\ell$ stays constant; the associated optimality conditions lead to a partial differential equation that must be satisfied by the associated function estimate $\widehat f$. From this formulation we derive several predictions on the limiting behavior the $d$-dimensional function $\widehat f$, including (a) a phase transition in its smoothness at the threshold $q = d + 1$; and (b) a tradeoff between smoothness and sensitivity to the underlying unlabeled data distribution $P$. Thus, over the range $q \leq d$, the function estimate $\widehat f$ is degenerate and ``spiky,'' whereas for $q\geq d+1$, the function estimate $\widehat f$ is smooth. We show that the effect of the underlying density vanishes monotonically with $q$, such that in the limit $q = \infty$, corresponding to the so-called Absolutely Minimal Lipschitz Extension, the estimate $\widehat f$ is independent of the distribution $P$. Under the assumption of semi-supervised smoothness, ignoring $P$ can lead to poor statistical performance; in particular, we construct a specific example for $d=1$ to demonstrate that $q=2$ has lower risk than $q=\infty$ due to the former penalty adapting to $P$ and the latter ignoring it. We also provide simulations that verify the accuracy of our predictions for finite sample sizes. Together, these properties show that $q = d+1$ is an optimal choice, yielding a function estimate $\widehat f$ that is both smooth and non-degenerate, while remaining maximally sensitive to $P$. |
10:50 | Optimal rates for total variation denoising SPEAKER: unknown ABSTRACT. Motivated by its practical success, we show that the two-dimensional total variation denoiser satisfies a sharp oracle inequality that leads to near optimal rates for a large class of image models such as isotonic, Hölder smooth and cartoons. Our analysis hinges on properties of the unnormalized Laplacian of the two- dimensional grid such eigenvectors delocalization and spectral decay. It can be extended to several random graph models with bounded degree. |
11:00 | Aggregation of supports along the Lasso path SPEAKER: Pierre C. Bellec ABSTRACT. In linear regression with fixed design, we propose two procedures that aggregate a data-driven collection of supports. The collection is a subset of the $2^p$ possible supports and both its cardinality and its elements can depend on the data. The procedures satisfy oracle inequalities with no assumption on the design matrix. Then we use these procedures to aggregate the supports that appear on the regularization path of the Lasso in order to construct an estimator that mimics the best Lasso estimator. If the restricted eigenvalue condition on the design matrix is satisfied, then this estimator achieves optimal prediction bounds. Finally, we discuss the computational cost of these procedures. |
16:10 | Dropping Convexity for Faster Semi-definite Optimization SPEAKER: Srinadh Bhojanapalli ABSTRACT. In this paper, we study the minimization of a convex function $f(X)$ over the space of $n\times n$ positive semidefinite matrices $X\succeq 0$, but when the problem is recast as the non-convex problem $\min_U g(U)$ where $g(U) := f(UU^\top)$, with $U$ being an $n\times r$ matrix and $r\leq n$\footnote{The two problems are equivalent when $r=n$, and $r We provide a rule for selecting the step size, and with this choice show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$ -- the error after $k$ steps is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. Note that $g$ is {\em not} locally convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle.
FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization; to the best of our knowledge, ours is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions. |
16:30 | Efficient approaches for escaping higher order saddle points in non-convex optimization SPEAKER: unknown ABSTRACT. Local search heuristics for non-convex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NP-hard to extend this further to finding fourth order local optima. |
16:40 | Gradient Descent only Converges to Minimizers SPEAKER: unknown ABSTRACT. We show that gradient descent generically does not converge to saddle points. This is proved using the Stable Manifold theorem from dynamical systems theory. |
16:50 | First-order Methods for Geodesically Convex Optimization SPEAKER: unknown ABSTRACT. Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by developing iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upper bounds for the global complexity of deterministic and stochastic (sub)gradient methods for optimizing smooth and nonsmooth g-convex functions, both with and without strong g-convexity. Our analysis also reveals how the manifold geometry, especially \emph{sectional curvature}, impacts convergence rates. To the best of our knowledge, our work is the first to provide global complexity analysis for first-order algorithms for general g-convex optimization. |
17:00 | A Light Touch for Heavily Constrained SGD SPEAKER: unknown ABSTRACT. Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update. For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration. Theoretical analysis shows a compelling trade-off between per-iteration work and the number of iterations needed on problems with a large number of constraints. |
17:10 | Highly-Smooth Zero-th Order Online Optimization SPEAKER: unknown ABSTRACT. The minimization of convex functions which are only available through partial and noisy information is a key methodological problem in many disciplines. In this paper we consider convex optimization with noisy zero-th order information, that is noisy function evaluations at any desired point. We focus on problems with high degrees of smoothness, such as logistic regression. We show that as opposed to gradient-based algorithms, high-order smoothness may be used to improve estimation rates, with a precise dependence of our upper-bounds on the degree of smoothness. In particular, we show that for infinitely differentiable functions, we recover the same dependence on sample size as gradient-based algorithms, with an extra dimension-dependent factor. This is done for both convex and strongly-convex functions, with finite horizon and anytime algorithms. Finally, we also recover similar results in the online optimization setting. |