Program for Friday, July 6th

PROGRAM FOR FRIDAY, JULY 6TH

Days:

next day

all days

View: session overview talk overview

08:00-08:55 Session 1: Registration

Registration

08:55-09:00 Session 2: Opening remarks

Opening remarks

Chairs:

Sebastien Bubeck and Philippe Rigollet

09:00-10:00 Session 3

Chair:

Manfred Warmuth

09:00	Tim Roughgarden and Joshua Wang An Optimal Learning Algorithm for Online Unconstrained Submodular Maximization ABSTRACT. We consider a basic problem at the interface of two fundamental fields: submodular optimization and online learning. In the online unconstrained submodular maximization (online USM) problem, there is a universe $[n]=\{1,2,\ldots,n\}$ and a sequence of $T$ nonnegative (not necessarily monotone) submodular functions arrive over time. The goal is to design a computationally efficient online algorithm, which chooses a subset of $[n]$ at each time step as a function only of the past, such that the accumulated value of the chosen subsets is as close as possible to the maximum total value of a fixed subset in hindsight. Our main result is a polynomial-time no-$\tfrac 12$-regret algorithm for this problem, meaning that for every sequence of nonnegative submodular functions, the algorithm's expected total value is at least $\tfrac{1}{2}$ times that of the best subset in hindsight, up to an error term sublinear in $T$. The factor of $\tfrac 12$ cannot be improved upon by any polynomial-time online algorithm, unless $NP = RP$. Prior to our work, the best result known was that picking a subset uniformly at random in every time step achieves no $\tfrac 14$-regret. A byproduct of our techniques is an explicit subroutine for the two-experts problem that has an unusually strong regret guarantee: the total value of its choices is comparable to twice the total value of either expert on rounds it did not pick that expert. This subroutine may be of independent interest.
09:10	Ashok Cutkosky and Francesco Orabona Black-Box Reductions for Parameter-free Online Learning in Banach Spaces ABSTRACT. We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce optimization over a constrained domain to unconstrained optimization. All of our reductions run as fast as online gradient descent. We use our new techniques to improve upon the previously best regret bounds for parameter-free learning, and do so for arbitrary norms.
09:20	Dirk van der Hoeven, Tim van Erven and Wojciech Kotłowski The Many Faces of Exponential Weights in Online Learning ABSTRACT. A standard introduction to online learning might place Online Gradient Descent at its center and then proceed to develop generalizations and extensions like Online Mirror Descent and second-order methods. Here we explore the alternative approach of putting exponential weights (EW) first. We show that many standard methods and their regret bounds then follow as a special case by plugging in suitable surrogate losses and playing the EW posterior mean. For instance, we easily recover Online Gradient Descent by using EW with a Gaussian prior on linearized losses, and, more generally, all instances of Online Mirror Descent based on regular Bregman divergences also correspond to EW with a prior that depends on the mirror map. Furthermore, appropriate quadratic surrogate losses naturally give rise to Online Gradient Descent for strongly convex losses and to Online Newton Step. We further interpret several recent adaptive methods (iProd, Squint, and a variation of Coin Betting for experts) as a series of closely related reductions to exp-concave surrogate losses that are then handled by Exponential Weights. Finally, a benefit of our EW interpretation is that it opens up the possibility of sampling from the EW posterior distribution instead of playing the mean. As already observed by Bubeck and Eldan (2015), this recovers the best-known rate in Online Bandit Linear Optimization.
09:30	Gautam Goel, Niangjun Chen and Adam Wierman Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent ABSTRACT. We study smoothed online convex optimization, a version of online convex optimization where the learner incurs a penalty for changing her actions between rounds. Given an Ω(√ d) lower bound on the competitive ratio of any online algorithm, where d is the dimension of the action space, we ask under what conditions this bound can be beaten. We introduce a novel algorithmic framework for this problem, Online Balanced Descent (OBD), which works by iteratively projecting the previous point onto a carefully chosen level set of the current cost function so as to balance the switching costs and hitting costs. We demonstrate the generality of the OBD framework by showing how, with different choices of “balance,” OBD can improve upon state-of-the-art performance guarantees for both competitive ratio and regret; in particular, OBD is the first algorithm to achieve a dimension-free competitive ratio, 3+O(1/α), for locally polyhedral costs, where α measures the “steepness” of the costs. We also prove bounds on the dynamic regret of OBD when the balance is performed in the dual space that are dimension-free and imply that OBD has sublinear static regret.
09:40	Jason Altschuler and Kunal Talwar Online learning over a finite action set with limited switching SPEAKER: Jason Altschuler ABSTRACT. We study the value of switching actions in the Prediction From Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First, we revisit the well-studied setting of PFE with switching costs. Many algorithms are known to achieve the optimal minimax order of $O(\sqrt{T \log n})$ in expectation for both regret and number of switches, where $T$ is the number of iterations and $n$ the number of actions. However, no high probability guarantees are known. Our main technical contribution is an algorithms whose regret and number of switches are both of this optimal order with high probability. This settles an open problem of Devroye, Lugosi and Neu(2015), and also directly implies the first high probability guarantees for several problems of interest. Next, to investigate the value of switching actions at a more granular level, we introduce the setting of switching budgets, in which the algorithm is limited to $S \leq T$ switches between actions. This entails a limited number of free switches, in contrast to the unlimited number of expensive switches allowed in the switching cost setting. Using the above result and several reductions, we unify previous work and completely characterize the complexity of this switching budget setting up to small polylogarithmic factors: for both the PFE and MAB problems, for all switching budgets $S \in [1, T]$, and for both expectation and high probability guarantees. For PFE, we show that the optimal rate is of order $\tilde{\Theta}(\sqrt{T\log n})$ for $S = \Omega(\sqrt{T\log n})$, and $\min(\tilde{\Theta}(\tfrac{T\log n}{S}), T)$ for $S = O(\sqrt{T \log n})$. Interestingly, the bandit setting does not have such a phase transition; instead we show the minimax rate decays steadily as $\min(\tilde{\Theta}(\tfrac{T\sqrt{n}}{\sqrt{S}}), T)$ for all ranges of $S \in [1,T]$. These results recover and generalize the known minimax rates for the (arbitrary) switching cost setting.
09:50	Dylan Foster, Alexander Rakhlin and Karthik Sridharan Online Learning: Sufficient Statistics and the Burkholder Method ABSTRACT. We uncover a fairly general principle: If regret can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that only depends on these sufficient statistics, not the entire data sequence. This function can be used algorithmically to achieve the regret bound, and the online strategy is only required to keep these sufficient statistics in memory. The characterization is achieved by bringing the full power of the Burkholder Method --- originally developed for certifying probabilistic martingale inequalities --- to bear on the online learning setting. To demonstrate the power of the proposed method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We present a linear-time/space prediction strategy for parameter free supervised learning with linear classes and general smooth norms.

10:00-10:30Coffee Break

10:30-12:00 Session 4

Chair:

Nicolo Cesa-Bianchi

10:30	Rishabh Dudeja and Daniel Hsu Learning Single Index Models in Gaussian Space SPEAKER: Rishabh Dudeja ABSTRACT. We consider regression problems where the response is a smooth but non-linear function of a $k$-dimensional projection of $p$ normally-distributed covariates, contaminated with additive Gaussian noise. The goal is to recover the range of the $k$-dimensional projection, i.e., the index space. This model is called the multi-index model, and the $k=1$ case is called the single-index model. For the single-index model, we characterize the population landscape of a natural semi-parametric maximum likelihood objective in terms of the link function and prove that it has no spurious local minima. We also propose and analyze an efficient iterative procedure that recovers the index space up to error $\epsilon$ using a sample size $\tilde{O}(p^{O(R^2/\mu)} + p/\epsilon^2)$, where $R$ and $\mu$, respectively, parameterize the smoothness of the link function and the signal strength. When a multi-index model is incorrectly specified as a single-index model, we prove that essentially the same procedure, with sample size $\tilde{O}(p^{O(kR^2/\mu)} + p/\epsilon^2)$, returns a vector that is $\epsilon$-close to being completely in the index space.
10:40	David Durfee, Kevin A. Lai and Saurabh Sawlani L1 Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent ABSTRACT. We present preconditioned stochastic gradient descent (SGD) algorithms for the $\ell_1$ minimization problem $\min_{x}\\|A x - b\\|_1$ in the overdetermined case, where there are far more constraints than variables. Specifically, we have $A \in \mathbb{R}^{n \times d}$ for $n \gg d$. Commonly known as the Least Absolute Deviations problem, $\ell_1$ regression can be used to solve many important combinatorial problems, such as minimum cut and shortest path. SGD-based algorithms are appealing for their simplicity and practical efficiency. % Our algorithms precondition the matrix $A$ and then solve the problem for the resulting matrix $\tilde{A}$ using gradient descent techniques. Our primary insight is that careful preprocessing can yield preconditioned matrices $\tilde{A}$ with strong properties (besides good condition number and low-dimension) that allow for faster convergence of gradient descent. In particular, we precondition using Lewis weights to obtain an isotropic matrix with fewer rows and strong upper bounds on all row norms. We leverage these conditions to find a good initialization, which we use along with recent smoothing reductions and accelerated stochastic gradient descent algorithms to achieve $\epsilon$ relative error in $\tilde{O}(nnz(A) + d^{2.5} \epsilon^{-2})$ time with high probability, where $nnz(A)$ is the number of non-zeros in $A$. This improves over the previous best result using gradient descent for $\ell_1$ regression. We also match the best known running times for interior point methods in several settings. Finally, we also show that if our original matrix $\AA$ is approximately isotropic and the row norms are approximately equal, we can give an algorithm that avoids using fast matrix multiplication and obtains a running time of $\tilde{O}(nnz(A) + s d^{1.5}\epsilon^{-2} + d^2\epsilon^{-2})$, where $s$ is the maximum number of non-zeros in a row of $A$. In this setting, we beat the best interior point methods for certain parameter regimes.
10:50	Florent Krzakala, Lenka Zdeborova, Jean Barbier, Nicolas Macris and Leo Miolane Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models ABSTRACT. Generalized linear models (GLMs) arise in high-dimensional machine learning, statistics, communications and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes or benchmarks models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal inference and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and dimensions are large and their ratio is fixed. Non-rigorous predictions for the optimal inference and generalization errors existed for special cases of GLMs, e.g. for the perceptron in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance, and locate the associated sharp phase transitions separating learnable and non-learnable regions
11:00	Guillaume Martinet and Samory Kpotufe Marginal Singularity, and the Benefits of Labels in Covariate-Shift SPEAKER: Guillaume Martinet ABSTRACT. We present new minimax results that concisely capture the relative benefits of source and target labeled data, under {covariate-shift}. Namely, we show that the benefits of target labels are controlled by a \emph{transfer-exponent} $\gamma$ that encodes how \emph{singular} $Q$ is locally w.r.t. $P$, and interestingly allows situations where transfer did not seem possible under previous insights. In fact, our new minimax analysis -- in terms of $\gamma$ -- reveals a \emph{continuum of regimes} ranging from situations where target labels have little benefit, to regimes where target labels dramatically improve classification. We then show that a recently proposed semi-supervised procedure can be extended to adapt to unknown $\gamma$, and therefore requests target labels only when beneficial, while achieving minimax transfer rates.
11:10	Yuanzhi Li and Yingyu Liang Learning Mixtures of Linear Regressions with Nearly Optimal Complexity ABSTRACT. Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.
11:20	Adam Klivans, Pravesh K Kothari and Raghu Meka Efficient Algorithms for Outlier-Robust Regression ABSTRACT. We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn iid from distribution D$ and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is _certifiably_ hypercontractive. This natural property is satisfied by many well-studied distributions such as Gaussians, affine transformations of isotropic strongly log-concave distributions and, the uniform distribution on the hypercube, among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets.
11:30	Shiva Kasiviswanathan and Mark Rudelson Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression ABSTRACT. High-dimensional settings, where the data dimension (d) far exceeds the number of observations (n), are common in many statistical and machine learning applications. Methods based on l_1-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix X in R^{n x d} satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution X Phi^\top Phi, where Phi in R^{m x d} is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily verifiable condition into the design process, and allows for generation of compressed design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix X.
11:40	Dylan Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri and Karthik Sridharan Logistic Regression: The Importance of Being Improper ABSTRACT. Learning linear predictors with the logistic loss --- both in stochastic and online settings ---is a fundamental task in learning and statistics, with direct connections to classification and boosting. Existing "fast rates" for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logisic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of mcmahan2012open when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We show that the improved dependency on predictor norm is also near-optimal. Leveraging this improved dependency on the predictor norm also yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $\tilde{O}(\sqrt{n})$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parameteric or even nonparametric classes.
11:50	Matthew Brennan, Guy Bresler and Wasim Huleihel Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure ABSTRACT. Recently, research in unsupervised learning has gravitated towards exploring statistical-computational gaps induced by sparsity. A recent line of work initiated in \cite{berthet2013complexity} has aimed to explain these gaps through reductions to conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the development of techniques and often led to weaker hardness results that only apply to algorithms robust to different noise distributions or that do not need to know the parameters of the problem. We introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. Our new lower bounds include: 1. Planted Independent Set: We show tight lower bounds for recovering a planted independent set of size $k$ in a sparse Erd\H{o}s-R\'{e}nyi graph of size $n$ with edge density $\tilde{\Theta}(n^{-\alpha})$. 2. Planted Dense Subgraph: If $p > q$ are the edge densities inside and outside of the community, we show the first lower bounds for the general regime $q = \tilde{\Theta}(n^{-\alpha})$ and $p - q = \tilde{\Theta}(n^{-\gamma})$ where $\gamma \ge \alpha$, matching the lower bounds predicted in \cite{chen2016statistical}. Our lower bounds apply to a deterministic community size $k$, resolving a question raised in \cite{hajek2015computational}. 3. Biclustering: We show strong lower bounds for Gaussian biclustering as a simple hypothesis testing problem to detect a uniformly at random planted flat $k \times k$ submatrix. 4. Sparse Rank-1 Submatrix: We show that sparse rank-1 submatrix detection is often harder than biclustering, and obtain two different tight lower bounds for these problems with different reductions from planted clique. 5. Sparse PCA: We give a reduction between rank-1 submatrix and sparse PCA to obtain tight lower bounds in the less sparse regime $k \gg \sqrt{n}$, when the spectral algorithm is optimal over the natural SDP. This yields the first tight characterization of a computational barrier for sparse PCA over an entire parameter regime. We also give an alternate reduction recovering the lower bounds of \cite{berthet2013complexity} and \cite{gao2017sparse} in the canonical simple hypothesis testing variant of sparse PCA, the spiked covariance model. 6. New Models: We demonstrate a subtlety in the complexity of sparse PCA and planted dense subgraph by introducing two variants of these problems, biased sparse PCA and planted stochastic block model, and showing that they have different hard regimes than the originals. Our results demonstrate that, despite the delicate nature of average-case reductions, using natural problems as intermediates can often be beneficial, as is the case for reductions between deterministic problems. Our main technical contribution is to introduce a set of cloning techniques that maintain the level of signal in an instance of a problem while increasing the size of its planted structure. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we introduce.

12:00-14:00Lunch Break

14:00-15:00 Session 5: Invited talk: Stephane Mallat

14:00

Stephane Mallat

Unsupervised Learning from Max Entropy to Deep Generative Networks

ABSTRACT. Learning high-dimensional probability distributions which are non-Gaussian, and non Markovian with long range interactions, is an outstanding problem where high order moments have mostly failed. Dimensionality reduction may however be achieved with scale separation. Following a maximum entropy approach, scale separation and stationarity assumptions will lead us to deep convolutional network architectures, with wavelet filters. Applications to turbulence models in physics and cosmology will be shown.

Generative convolutional networks have obtained spectacular results to synthesize complex signals such as images, speech, music, with barely any mathematical understanding. This lecture will explore this world by showing that the generator calculation can be reduced to an inverse problem. Maximum entropy optimization is replaced by a convolutional neural network regularization. This will be illustrated by image and music generations.

15:20-16:20 Session 6

Chair:

Guy Bresler

15:20	Jonathan Weed An explicit analysis of the entropic penalty in linear programming ABSTRACT. Solving linear programs by using entropic penalization has recently attracted new interest in the optimization community, since this strategy forms the basis for the fastest-known algorithms for the optimal transport problem, with many applications in modern large-scale machine learning. Crucial to these applications has been an analysis of how quickly solutions to the penalized program approach true optima to the original linear program. More than 20 years ago, Cominetti and San Martín showed that this convergence is exponentially fast; however, their proof is asymptotic and does not give any indication of how accurately the entropic program approximates the original program for any particular choice of the penalization parameter. We close this long-standing gap in the literature regarding entropic penalization by giving a new proof of the exponential convergence, valid for any linear program. Our proof is non-asymptotic, yields explicit constants, and has the virtue of being extremely simple. We provide matching lower bounds and show that the entropic approach does not lead to a near-linear time approximation scheme for the linear assignment problem.
15:30	Naman Agarwal and Elad Hazan Lower Bounds for Higher-Order Convex Optimization ABSTRACT. State-of-the-art methods in mathematical optimization employ higher-order derivative information. We explore the limitations of higher-order optimization and prove that even for convex optimization, a polynomial dependence on the approximation guarantee and higher-order smoothness parameters is necessary. This refutes the hope that higher-order smoothness and higher-order derivatives can lead to dimension free polynomial time algorithms for convex optimization. As a special case, we show Nesterov's accelerated cubic regularization method and higher-order methods to be nearly tight.
15:40	Yin Tat Lee, Aaron Sidford and Santosh Vempala Efficient Convex Optimization with Membership Oracles ABSTRACT. We consider the problem of minimizing a convex function over a convex set given access only to an evaluation oracle for the function and a membership oracle for the set. We give a simple algorithm which solves this problem with O~(n^2) oracle calls and O~(n^3) additional arithmetic operations. Using this result, we obtain more efficient reductions among the five basic oracles for convex sets and functions defined by Grötschel et al. (1988).
15:50	Jacob Abernethy, Kevin A. Lai, Kfir Y. Levy and Jun-Kun Wang Faster Rates for Convex-Concave Games ABSTRACT. We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $O(T^{-1/2})$, recent work \citep{RS13,SALS15} has established $O(1/T)$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a particular class of games one achieves a $O(1/T^2)$ rate, and we show how this applies to the Frank-Wolfe method and recovers a similar bound \citep{D15}. We also show that such no-regret techniques can even achieve a linear rate, $O(\exp(-T))$, for equilibrium computation under additional curvature assumptions.
16:00	Srinadh Bhojanapalli, Nicolas Boumal, Prateek Jain and Praneeth Netrapalli Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form ABSTRACT. Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer--Monteiro factorization approach for solving SDPs. For a large class of SDPs, upon random perturbation of the cost matrix, with high probability, we show that all approximate second-order stationary points are approximate global optima for the penalty formulation of appropriately rank-constrained SDPs, as long as the number of constraints scales sub-quadratically with the desired rank. Our result is based on a simple penalty function formulation of the rank-constrained SDP along with a smoothed analysis to avoid worst-case cost matrices. We particularize our results to two applications, namely, Max-Cut and matrix completion.
16:10	Oren Mangoubi and Nisheeth Vishnoi Convex Optimization with Unbounded Nonconvex Oracles Using Simulated Annealing ABSTRACT. We consider the problem of minimizing a convex objective function $F$ when one can only evaluate its noisy approximation $\hat{F}$. Unless one assumes some structure on the noise, $\hat{F}$ may be an arbitrary nonconvex function, making the task of minimizing $F$ intractable. To overcome this, prior work has often focused on the case when $F(x)-\hat{F}(x)$ is uniformly-bounded. In this paper we study the more general case when the noise has magnitude $\alpha F(x) + \beta$ for some $\alpha, \beta > 0$, and present a polynomial time algorithm that finds an approximate minimizer of $F$ for this noise model. Previously, Markov chains, such as the stochastic gradient Langevin dynamics, have been used to arrive at approximate solutions to these optimization problems. However, for the noise model considered in this paper, no single temperature allows such a Markov chain to both mix quickly and concentrate near the global minimizer. We bypass this by combining ``simulated annealing" with the stochastic gradient Langevin dynamics, and gradually decreasing the temperature of the chain in order to approach the global minimizer. As a corollary one can approximately minimize a nonconvex function that is close to a convex function; however, the closeness can deteriorate as one moves away from the optimum.

16:30-17:30 Session 7

Chair:

Andrew Nobel

16:30	Yanjun Han, Jiantao Jiao and Tsachy Weissman Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance ABSTRACT. We present \emph{Local Moment Matching (LMM)}, a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. We construct an efficiently computable estimator that achieves the minimax rates in estimating the distribution up to permutation, and show that the plug-in approach of our unlabeled distribution estimator is ``universal" in estimating symmetric functionals of discrete distributions. Instead of doing best polynomial approximation explicitly as in existing literature of functional estimation, the plug-in approach conducts polynomial approximation implicitly and attains the optimal sample complexity for the entropy, power sum and support size functionals.
16:40	Ilias Diakonikolas, Jerry Li and Ludwig Schmidt Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms ABSTRACT. We study the problem of robustly learning multi-dimensional histograms. A $d$-dimensional function $h: D \to \R$ is called a $k$-histogram if there exists a partition of the domain $D \subseteq \R^d$ into $k$ axis-aligned rectangles such that $h$ is constant within each such rectangle. Let $f: D \to \R$ be an arbitrary $d$-dimensional probability density function, and suppose that $f$ is $\mathrm{OPT}$-close, in $L_1$-distance, to an unknown $k$-histogram (with unknown partition). Our goal is to output a hypothesis that is $O(\mathrm{OPT}) + \epsilon$ close to $f$, in $L_1$-distance. We give an efficient algorithm for this learning problem that, for any fixed dimension, uses near-optimal sample complexity (up to logarithmic factors), and runs in sample near-linear time. Prior to our work, the case of $d=1$ was well-understood, but vast gaps in our understanding remained even for $d=2$.
16:50	James Sharpnack Learning Patterns for Detection with Multiscale Scan Statistics ABSTRACT. This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis---that the data is only noise---is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an $\epsilon$-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds.
17:00	Yuval Dagan and Ohad Shamir Detecting Correlations with Little Memory and Communication ABSTRACT. We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise correlations with optimal sample complexity, the number of required memory/communication bits is at least quadratic in the dimension. Our results substantially improve those of \cite{shamir2014fundamental}, which studied a similar question in a much more restricted setting. To the best of our knowledge, these are the first provable sample/memory/communication trade-offs for a practical estimation problem, using standard distributions, and in the natural regime where the memory/communication budget is larger than the size of a single data point. To derive our theorems, we prove a new information-theoretic result, which may be relevant for studying other information-constrained learning problems.
17:10	Steve Hanneke, Adam Kalai, Gautam Kamath and Christos Tzamos Actively Avoiding Nonsense in Generative Models ABSTRACT. A generative model may generate utter nonsense when it is fit to maximize the likelihood of observed data. This happens due to ``model error,'' i.e., when the true data generating distribution does not fit within the class of generative models being learned. To address this, we propose a model of active distribution learning using a binary invalidity oracle that identifies some examples as clearly invalid, together with random positive examples sampled from the true distribution. The goal is to maximize the likelihood of the positive examples subject to the constraint of (almost) never generating examples labeled invalid by the oracle. Guarantees are agnostic compared to a class of probability distributions. We show that, while proper learning often requires exponentially many queries to the invalidity oracle, improper distribution learning can be done using polynomially many queries.
17:20	Timothy Carpenter, Ilias Diakonikolas, Anastasios Sidiropoulos and Alistair Stewart Near-Optimal Sample Complexity Bounds for Maximum Likelihood Estimation of Multivariate Log-concave Densities ABSTRACT. We study the problem of learning multivariate log-concave densities with respect to a global loss function. We obtain the {\em first} upper bound on the sample complexity of the maximum likelihood estimator (MLE) for a log-concave density on $\R^d$, for all $d \geq 4$. Prior to this work, no finite sample upper bound was known for this estimator in more than $3$ dimensions. In more detail, we prove that for any $d \geq 1$ and $\eps>0$, given $\tilde{O}_d((1/\eps)^{(d+3)/2})$ samples drawn from an unknown log-concave density $f_0$ on $\R^d$, the MLE outputs a hypothesis $h$ that with high probability is $\eps$-close to $f_0$, in squared Hellinger loss. A sample complexity lower bound of $\Omega_d((1/\eps)^{(d+1)/2})$ was previously known for any learning algorithm that achieves this guarantee. We thus establish that the sample complexity of the log-concave MLE is near-optimal, up to an $\tilde{O}(1/\eps)$ factor.

17:30-19:30 Session 8

Poster Session