View: session overviewtalk overview
Registration
09:00 | An Optimal Learning Algorithm for Online Unconstrained Submodular Maximization ABSTRACT. We consider a basic problem at the interface of two fundamental fields: submodular optimization and online learning. In the online unconstrained submodular maximization (online USM) problem, there is a universe $[n]=\{1,2,\ldots,n\}$ and a sequence of $T$ nonnegative (not necessarily monotone) submodular functions arrive over time. The goal is to design a computationally efficient online algorithm, which chooses a subset of $[n]$ at each time step as a function only of the past, such that the accumulated value of the chosen subsets is as close as possible to the maximum total value of a fixed subset in hindsight. Our main result is a polynomial-time no-$\tfrac 12$-regret algorithm for this problem, meaning that for every sequence of nonnegative submodular functions, the algorithm's expected total value is at least $\tfrac{1}{2}$ times that of the best subset in hindsight, up to an error term sublinear in $T$. The factor of $\tfrac 12$ cannot be improved upon by any polynomial-time online algorithm, unless $NP = RP$. Prior to our work, the best result known was that picking a subset uniformly at random in every time step achieves no $\tfrac 14$-regret. A byproduct of our techniques is an explicit subroutine for the two-experts problem that has an unusually strong regret guarantee: the total value of its choices is comparable to twice the total value of either expert on rounds it did not pick that expert. This subroutine may be of independent interest. |
09:10 | Black-Box Reductions for Parameter-free Online Learning in Banach Spaces ABSTRACT. We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce optimization over a constrained domain to unconstrained optimization. All of our reductions run as fast as online gradient descent. We use our new techniques to improve upon the previously best regret bounds for parameter-free learning, and do so for arbitrary norms. |
09:20 | The Many Faces of Exponential Weights in Online Learning ABSTRACT. A standard introduction to online learning might place Online Gradient Descent at its center and then proceed to develop generalizations and extensions like Online Mirror Descent and second-order methods. Here we explore the alternative approach of putting exponential weights (EW) first. We show that many standard methods and their regret bounds then follow as a special case by plugging in suitable surrogate losses and playing the EW posterior mean. For instance, we easily recover Online Gradient Descent by using EW with a Gaussian prior on linearized losses, and, more generally, all instances of Online Mirror Descent based on regular Bregman divergences also correspond to EW with a prior that depends on the mirror map. Furthermore, appropriate quadratic surrogate losses naturally give rise to Online Gradient Descent for strongly convex losses and to Online Newton Step. We further interpret several recent adaptive methods (iProd, Squint, and a variation of Coin Betting for experts) as a series of closely related reductions to exp-concave surrogate losses that are then handled by Exponential Weights. Finally, a benefit of our EW interpretation is that it opens up the possibility of sampling from the EW posterior distribution instead of playing the mean. As already observed by Bubeck and Eldan (2015), this recovers the best-known rate in Online Bandit Linear Optimization. |
09:30 | Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent ABSTRACT. We study smoothed online convex optimization, a version of online convex optimization where the learner incurs a penalty for changing her actions between rounds. Given an Ω(√ d) lower bound on the competitive ratio of any online algorithm, where d is the dimension of the action space, we ask under what conditions this bound can be beaten. We introduce a novel algorithmic framework for this problem, Online Balanced Descent (OBD), which works by iteratively projecting the previous point onto a carefully chosen level set of the current cost function so as to balance the switching costs and hitting costs. We demonstrate the generality of the OBD framework by showing how, with different choices of “balance,” OBD can improve upon state-of-the-art performance guarantees for both competitive ratio and regret; in particular, OBD is the first algorithm to achieve a dimension-free competitive ratio, 3+O(1/α), for locally polyhedral costs, where α measures the “steepness” of the costs. We also prove bounds on the dynamic regret of OBD when the balance is performed in the dual space that are dimension-free and imply that OBD has sublinear static regret. |
09:40 | Online learning over a finite action set with limited switching SPEAKER: Jason Altschuler ABSTRACT. We study the value of switching actions in the Prediction From Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First, we revisit the well-studied setting of PFE with switching costs. Many algorithms are known to achieve the optimal minimax order of $O(\sqrt{T \log n})$ in expectation for both regret and number of switches, where $T$ is the number of iterations and $n$ the number of actions. However, no high probability guarantees are known. Our main technical contribution is an algorithms whose regret and number of switches are both of this optimal order with high probability. This settles an open problem of Devroye, Lugosi and Neu(2015), and also directly implies the first high probability guarantees for several problems of interest. Next, to investigate the value of switching actions at a more granular level, we introduce the setting of switching budgets, in which the algorithm is limited to $S \leq T$ switches between actions. This entails a limited number of free switches, in contrast to the unlimited number of expensive switches allowed in the switching cost setting. Using the above result and several reductions, we unify previous work and completely characterize the complexity of this switching budget setting up to small polylogarithmic factors: for both the PFE and MAB problems, for all switching budgets $S \in [1, T]$, and for both expectation and high probability guarantees. For PFE, we show that the optimal rate is of order $\tilde{\Theta}(\sqrt{T\log n})$ for $S = \Omega(\sqrt{T\log n})$, and $\min(\tilde{\Theta}(\tfrac{T\log n}{S}), T)$ for $S = O(\sqrt{T \log n})$. Interestingly, the bandit setting does not have such a phase transition; instead we show the minimax rate decays steadily as $\min(\tilde{\Theta}(\tfrac{T\sqrt{n}}{\sqrt{S}}), T)$ for all ranges of $S \in [1,T]$. These results recover and generalize the known minimax rates for the (arbitrary) switching cost setting. |
09:50 | Online Learning: Sufficient Statistics and the Burkholder Method ABSTRACT. We uncover a fairly general principle: If regret can be (approximately) expressed as a function of certain "sufficient statistics" for the data sequence, then there exists a special Burkholder function that only depends on these sufficient statistics, not the entire data sequence. This function can be used algorithmically to achieve the regret bound, and the online strategy is only required to keep these sufficient statistics in memory. The characterization is achieved by bringing the full power of the Burkholder Method --- originally developed for certifying probabilistic martingale inequalities --- to bear on the online learning setting. To demonstrate the power of the proposed method, we develop a novel online strategy for matrix prediction that attains a regret bound corresponding to the variance term in matrix concentration inequalities. We present a linear-time/space prediction strategy for parameter free supervised learning with linear classes and general smooth norms. |
10:30 | Learning Single Index Models in Gaussian Space SPEAKER: Rishabh Dudeja ABSTRACT. We consider regression problems where the response is a smooth but non-linear function of a $k$-dimensional projection of $p$ normally-distributed covariates, contaminated with additive Gaussian noise. The goal is to recover the range of the $k$-dimensional projection, i.e., the index space. This model is called the multi-index model, and the $k=1$ case is called the single-index model. For the single-index model, we characterize the population landscape of a natural semi-parametric maximum likelihood objective in terms of the link function and prove that it has no spurious local minima. We also propose and analyze an efficient iterative procedure that recovers the index space up to error $\epsilon$ using a sample size $\tilde{O}(p^{O(R^2/\mu)} + p/\epsilon^2)$, where $R$ and $\mu$, respectively, parameterize the smoothness of the link function and the signal strength. When a multi-index model is incorrectly specified as a single-index model, we prove that essentially the same procedure, with sample size $\tilde{O}(p^{O(kR^2/\mu)} + p/\epsilon^2)$, returns a vector that is $\epsilon$-close to being completely in the index space. |
10:40 | L1 Regression using Lewis Weights Preconditioning and Stochastic Gradient Descent ABSTRACT. We present preconditioned stochastic gradient descent (SGD) algorithms for the $\ell_1$ minimization problem $\min_{x}\|A x - b\|_1$ in the overdetermined case, where there are far more constraints than variables. Specifically, we have $A \in \mathbb{R}^{n \times d}$ for $n \gg d$. Commonly known as the Least Absolute Deviations problem, $\ell_1$ regression can be used to solve many important combinatorial problems, such as minimum cut and shortest path. SGD-based algorithms are appealing for their simplicity and practical efficiency. % Our algorithms precondition the matrix $A$ and then solve the problem for the resulting matrix $\tilde{A}$ using gradient descent techniques. Our primary insight is that careful preprocessing can yield preconditioned matrices $\tilde{A}$ with strong properties (besides good condition number and low-dimension) that allow for faster convergence of gradient descent. In particular, we precondition using Lewis weights to obtain an isotropic matrix with fewer rows and strong upper bounds on all row norms. We leverage these conditions to find a good initialization, which we use along with recent smoothing reductions and accelerated stochastic gradient descent algorithms to achieve $\epsilon$ relative error in $\tilde{O}(nnz(A) + d^{2.5} \epsilon^{-2})$ time with high probability, where $nnz(A)$ is the number of non-zeros in $A$. This improves over the previous best result using gradient descent for $\ell_1$ regression. We also match the best known running times for interior point methods in several settings. Finally, we also show that if our original matrix $\AA$ is approximately isotropic and the row norms are approximately equal, we can give an algorithm that avoids using fast matrix multiplication and obtains a running time of $\tilde{O}(nnz(A) + s d^{1.5}\epsilon^{-2} + d^2\epsilon^{-2})$, where $s$ is the maximum number of non-zeros in a row of $A$. In this setting, we beat the best interior point methods for certain parameter regimes. |
10:50 | Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models ABSTRACT. Generalized linear models (GLMs) arise in high-dimensional machine learning, statistics, communications and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes or benchmarks models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal inference and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and dimensions are large and their ratio is fixed. Non-rigorous predictions for the optimal inference and generalization errors existed for special cases of GLMs, e.g. for the perceptron in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance, and locate the associated sharp phase transitions separating learnable and non-learnable regions |
11:00 | Marginal Singularity, and the Benefits of Labels in Covariate-Shift SPEAKER: Guillaume Martinet ABSTRACT. We present new minimax results that concisely capture the relative benefits of source and target labeled data, under {covariate-shift}. Namely, we show that the benefits of target labels are controlled by a \emph{transfer-exponent} $\gamma$ that encodes how \emph{singular} $Q$ is locally w.r.t. $P$, and interestingly allows situations where transfer did not seem possible under previous insights. In fact, our new minimax analysis -- in terms of $\gamma$ -- reveals a \emph{continuum of regimes} ranging from situations where target labels have little benefit, to regimes where target labels dramatically improve classification. We then show that a recently proposed semi-supervised procedure can be extended to adapt to unknown $\gamma$, and therefore requests target labels only when beneficial, while achieving minimax transfer rates. |
11:10 | Learning Mixtures of Linear Regressions with Nearly Optimal Complexity ABSTRACT. Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity $N = \tilde{O}(d)$ as well as nearly optimal computational complexity $\tilde{O}(Nd)$, where $d$ is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting. |
11:20 | Efficient Algorithms for Outlier-Robust Regression ABSTRACT. We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn iid from distribution D$ and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is _certifiably_ hypercontractive. This natural property is satisfied by many well-studied distributions such as Gaussians, affine transformations of isotropic strongly log-concave distributions and, the uniform distribution on the hypercube, among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets. |
11:30 | Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression ABSTRACT. High-dimensional settings, where the data dimension (d) far exceeds the number of observations (n), are common in many statistical and machine learning applications. Methods based on l_1-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix X in R^{n x d} satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution X Phi^\top Phi, where Phi in R^{m x d} is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily verifiable condition into the design process, and allows for generation of compressed design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix X. |
11:40 | Logistic Regression: The Importance of Being Improper ABSTRACT. Learning linear predictors with the logistic loss --- both in stochastic and online settings ---is a fundamental task in learning and statistics, with direct connections to classification and boosting. Existing "fast rates" for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logisic loss is $1$-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of mcmahan2012open when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We show that the improved dependency on predictor norm is also near-optimal. Leveraging this improved dependency on the predictor norm also yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an $\tilde{O}(\sqrt{n})$ relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parameteric or even nonparametric classes. |
11:50 | Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure ABSTRACT. Recently, research in unsupervised learning has gravitated towards exploring statistical-computational gaps induced by sparsity. A recent line of work initiated in \cite{berthet2013complexity} has aimed to explain these gaps through reductions to conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has limited the development of techniques and often led to weaker hardness results that only apply to algorithms robust to different noise distributions or that do not need to know the parameters of the problem. We introduce several new techniques to give a web of average-case reductions showing strong computational lower bounds based on the planted clique conjecture. Our new lower bounds include: 1. Planted Independent Set: We show tight lower bounds for recovering a planted independent set of size $k$ in a sparse Erd\H{o}s-R\'{e}nyi graph of size $n$ with edge density $\tilde{\Theta}(n^{-\alpha})$. 2. Planted Dense Subgraph: If $p > q$ are the edge densities inside and outside of the community, we show the first lower bounds for the general regime $q = \tilde{\Theta}(n^{-\alpha})$ and $p - q = \tilde{\Theta}(n^{-\gamma})$ where $\gamma \ge \alpha$, matching the lower bounds predicted in \cite{chen2016statistical}. Our lower bounds apply to a deterministic community size $k$, resolving a question raised in \cite{hajek2015computational}. 3. Biclustering: We show strong lower bounds for Gaussian biclustering as a simple hypothesis testing problem to detect a uniformly at random planted flat $k \times k$ submatrix. 4. Sparse Rank-1 Submatrix: We show that sparse rank-1 submatrix detection is often harder than biclustering, and obtain two different tight lower bounds for these problems with different reductions from planted clique. 5. Sparse PCA: We give a reduction between rank-1 submatrix and sparse PCA to obtain tight lower bounds in the less sparse regime $k \gg \sqrt{n}$, when the spectral algorithm is optimal over the natural SDP. This yields the first tight characterization of a computational barrier for sparse PCA over an entire parameter regime. We also give an alternate reduction recovering the lower bounds of \cite{berthet2013complexity} and \cite{gao2017sparse} in the canonical simple hypothesis testing variant of sparse PCA, the spiked covariance model. 6. New Models: We demonstrate a subtlety in the complexity of sparse PCA and planted dense subgraph by introducing two variants of these problems, biased sparse PCA and planted stochastic block model, and showing that they have different hard regimes than the originals. Our results demonstrate that, despite the delicate nature of average-case reductions, using natural problems as intermediates can often be beneficial, as is the case for reductions between deterministic problems. Our main technical contribution is to introduce a set of cloning techniques that maintain the level of signal in an instance of a problem while increasing the size of its planted structure. We also give algorithms matching our lower bounds and identify the information-theoretic limits of the models we introduce. |
Poster Session