Adaptivity to Noise Parameters in Nonparametric Active Learning
ABSTRACT. The present work addresses some open questions in the theory of active learning for nonparametric
classification. Our contributions are both statistical and algorithmic:
• We establish new minimax-rates for active learning under common noise conditions. These rates display
interesting transitions – due to the interaction between noise smoothness and margin – not present
in the passive setting. Some such transitions were previously conjectured but remained unconfirmed.
• We present a generic algorithmic strategy for adaptivity to unknown noise smoothness and margin;
our strategy achieves optimal rates in many general situations; furthermore, unlike in previous work,
we avoid the need for adaptive confidence sets, resulting in strictly milder distributional requirements
Computationally Efficient Robust Estimation of Sparse Functionals
ABSTRACT. Many conventional statistical procedures are extremely sensitive to seemingly minor deviations from modeling assumptions. This problem is exacerbated in modern high-dimensional settings, where the problem dimension can grow with and possibly exceed the sample size. We consider the problem of robust estimation of sparse functionals, and provide a computationally and statistically efficient algorithm in the high-dimensional setting. Our theory identifies a unified set of deterministic conditions under which our algorithm guarantees accurate recovery. By further establishing that these deterministic conditions hold with high-probability for a wide range of statistical models, our theory applies to many problems of considerable interest including sparse mean and covariance estimation; sparse linear regression; and sparse generalized linear models. In certain settings, such as the detection and estimation of sparse principal components in the spiked covariance model, our general theory does not yield optimal sample complexity, and we provide a novel algorithm based on the same intuition which is able to take advantage of further structure of the problem to achieve nearly optimal rates.
Robust Proper Learning for Mixtures of Gaussians via Systems of Polynomial Inequalities
ABSTRACT. Learning a Gaussian mixture model (GMM) is a fundamental statistical problem.
One common notion of learning a GMM is proper learning: here, the goal is to find a mixture of $k$ Gaussians $\mixpdf$ that is close to the unknown density $f$ from which we draw samples.
The distance between $\mixpdf$ and $f$ is often measured in the total variation / $L_1$-distance.
Our main result is an algorithm for learning a mixture of $k$ univariate Gaussians that is \emph{nearly-optimal} for any fixed $k$.
It is well known that the sample complexity of properly learning a univariate $k$-GMM is $O(k / \eps^2)$.
However, the best prior \emph{running time} for this problem is $\Otilde(1 / \eps^{3k-1})$; in particular, the dependence between $1/\eps$ and $k$ is exponential.
In this paper, we significantly improve this dependence by replacing the $1/\eps$ term with $\log 1/\eps$, while only increasing the exponent moderately.
Specifically, the running time of our algorithm is $(k \cdot \log 1/ \eps)^{O(k^4)} + \Otilde(k / \eps^2)$.
For any fixed $k$, the $\Otilde(k / \eps^2)$ term dominates our running time, and thus our algorithm runs in time which is \emph{nearly-linear} in the number of samples drawn.
Achieving a running time of $\text{poly}(k, 1 / \eps)$ for proper learning of $k$-GMMs has recently been stated as an open problem by multiple researchers, and we make progress on this question.
Our main algorithmic ingredient is a new connection between proper learning of parametric distributions and systems of polynomial inequalities.
We leverage results for piecewise polynomial approximation of GMMs and reduce the learning problem to a much smaller sub-problem.
While tihs sub-problem is still non-convex, its size depends only logarithmically on the final accuracy $\eps$.
Hence we can invoke computationally expensive methods for solving the sub-problem.
We show that our connection is also useful in the multivariate setting, where we get new results for learning a mixture of two spherical Gaussians.
A variant of our approach is also within reach of modern computer algebra systems.
Experiments for learning a 2-GMM show promising results: our algorithm improves over the popular Expectation-Maximization (EM) algorithm in the noisy setting.
Ignoring Is a Bliss: Learning with Large Noise Through Reweighting-Minimization
ABSTRACT. We consider learning in the presence of arbitrary noise that can overwhelm the
signal in terms of magnitude on a fraction of data points observed (aka outliers). Standard approaches based on minimizing empirical loss can fail
miserably and lead to arbitrary bad solutions in this setting. We propose an
approach that iterates between finding a solution with minimal empirical loss and re-weighting the data, reinforcing data points where the previous solution works well. We show that our approach can handle arbitrarily large noise, is robust as having a non-trivial breakdown point, and converges linearly under certain conditions.
The intuitive idea of our approach is to automatically exclude ``difficult''
data points from model fitting. More importantly (and perhaps surprisingly), we validate this intuition by establishing guarantees for generalization and iteration complexity that essentially ignore the presence of outliers.
ABSTRACT. We consider the problem of outlier robust PCA (\orpca) where the goal is to recover principal components despite the presence of outlier data points. That is, given a data matrix $\Mo$, where $(1-\alpha)$ fraction of the points are noisy samples from a low-dimensional subspace while $\alpha$ fraction of the points can be arbitrary outliers, the goal is to recover the subspace accurately. Existing results for \orpca~have serious drawbacks: while some results are quite weak in the presence of noise, other results have runtime quadratic in dimension, rendering them impractical for large scale applications.
In this work, we provide a novel thresholding based iterative algorithm with per-iteration complexity at most linear in the data size. Moreover, the fraction of outliers, $\alpha$, that our method can handle is tight up to constants while providing nearly optimal computational complexity for a general noise setting. For the special case where the inliers are obtained from a low-dimensional subspace with additive Gaussian noise, we show that a modification of our thresholding based method leads to significant improvement in recovery error (of the subspace) even in the presence of a large fraction of outliers.
Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality
ABSTRACT. A number of statistical estimation problems can be addressed by semidefinite programs (SDP). While SDPs are solvable in polynomial time using interior point methods, in practice generic SDP solvers do not scale well to high-dimensional problems. In order to cope with this problem, Burer and Monteiro proposed a non-convex rank-constrained formulation, which has good performance in practice but is still poorly understood theoretically.
In this paper we study the rank-constrained version of SDPs arising in MaxCut and in synchronization problems. We establish a Grothendieck-type inequality that proves that all the local maxima and dangerous saddle points are within a small multiplicative gap from the global maximum. We use this structural information to prove that SDPs can be solved within a known accuracy, by applying the Riemannian trust-region method to this non-convex problem, while constraining the rank to be of order one. Finally, we study the estimation error achieved by local maximizers. We prove that it undergoes a phase transition at the same threshold as for information-theoretically optimal methods.
As a side result, we revisit the MaxCut problem. We prove that any local maximizer of the rank-constrained SDP provides a $(1 - 1/(k-1)) \times 0.878$ approximation of the MaxCut, when the rank is fixed to $k$.
Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems
ABSTRACT. Max-cut, clustering, and many other partitioning problems that are of significant importance to machine learning and other scientific fields are NP-hard, a reality that has motivated researchers to develop a wealth of approximation algorithms and heuristics. Although the best algorithm to use typically depends on the specific application domain, a worst-case analysis is often used to compare algorithms. This may be misleading if worst-case instances occur infrequently, and thus there is a demand for optimization methods which return the algorithm configuration best suited for the given application's typical inputs. We address this problem for clustering, max-cut, and other partitioning problems, such as integer quadratic programming, by designing computationally efficient and sample efficient learning algorithms which receive samples from an application-specific distribution over problem instances and learn a partitioning algorithm with high expected performance. Our algorithms learn over common integer quadratic programming and clustering algorithm families: SDP rounding algorithms and agglomerative clustering algorithms with dynamic programming. For our sample complexity analysis, we provide tight bounds on the pseudodimension of these algorithm classes, and show that surprisingly, even for classes of algorithms parameterized by a single parameter, the pseudo-dimension is superconstant. In this way, our work both contributes to the foundations of algorithm configuration and pushes the boundaries of learning theory, since the algorithm classes we analyze consist of multi-stage optimization procedures and are significantly more complex than classes typically studied in learning theory.
Greed Is Good: Near-Optimal Submodular Maximization via Greedy Optimization
ABSTRACT. It is known that greedy methods perform well for maximizing monotone submodular functions. At the same time such methods perform poorly in the face of non-monotonicity. In this paper, we
show—arguably, surprisingly—that invoking the classical greedy algorithm O(k)-times leads
to the (currently) fastest deterministic algorithm, called REPEATEDGREEDY, for maximizing a
general submodular function subject to k-independent system constraints. REPEATEDGREEDY achieves (1 + O(1/ k))k approximation using O(nrk) function evaluations (here, n and r denote the size of the ground set and the maximum size of a feasible solution, respectively). We then show that by a careful sampling procedure, we can run the greedy algorithm only once and obtain the (currently) fastest randomized algorithm, called SAMPLEGREEDY, for maximizing a submodular function subject to k-extendible system constraints (a subclass of k-independent system constrains). SAMPLEGREEDY achieves (k + 3)-approximation with only O(nr/k) function evaluations. Finally, we derive an almost matching lower bound, and show that no polynomial time algorithm can have an approximation ratio smaller than k + 1/2 − ε. To further support our theoretical results, we compare the performance of REPEATEDGREEDY and SAMPLEGREEDY with prior art in a concrete application (movie recommendation). We consistently observe that while SAMPLEGREEDY achieves practically the same utility as the best baseline, it performs at least two orders of magnitude faster.
ABSTRACT. We consider the problem of maximizing a monotone submodular function under noise. Since the 1970s there has been a great deal of work on optimization of submodular functions under various constraints, resulting in algorithms that provide desirable approximation guarantees. In many applications, however, we do not have access to the submodular function we aim to optimize, but rather to some erroneous or noisy version of it. This raises the question of whether provable guarantees are obtainable in presence of error and noise. We provide initial answers by focusing on the problem of maximizing a monotone submodular function under a cardinality constraint when given access to a noisy oracle of the function. We show that there is an algorithm whose approximation ratio is arbitrarily close to the optimal $1-1/e$ when the cardinality is sufficiently large. The algorithm can be applied to optimize approximately submodular functions, optimization with correlated noise distributions, and various other noise models. When the noise is adversarial we show that no non-trivial approximation guarantee can be obtained.
ABSTRACT. This article studies the correspondence retrieval problem: a set of $k$ distinct but unknown points $x_1, x_2, \ldots, x_k \in \mathbb{R}^d$ are to be recovered from the unordered collection of projection values $\langle w_i,x_1 \rangle, \langle w_i, x_2 \rangle, \ldots, \langle w_i, x_k \rangle$ onto $n$ known measurement vectors $w_1, w_2, \ldots, w_n$. Importantly, the correspondence of the $k$ projections $\{\langle w_i, x_j \rangle\}_{j=1}^k$ across different measurements is unknown. A special case of this problem is the well-studied problem of (real-valued) phase retrieval. In the case of independent standard Gaussian measurement vectors, the main algorithm proposed in this work requires $n = d+1$ measurements to correctly return the $k$ unknown points with high probability. This number of measurements is optimal, and it is smaller than the number of measurements required for a stronger ``for all'' guarantee even in the phase retrieval setting. The algorithm is based on reductions to the Shortest Vector Problem on certain random lattices, and employs the Lenstra, Lenstra, \& Lov{\'a}sz (1982) basis reduction algorithm in a manner similar to the Lagarias \& Odlyzko (1985) algorithm for solving random instances of Subset Sum. Another algorithm, essentially due to Yi, Caramanis, \& Sanghavi (2016), based on higher-order moments and tensor decompositions is shown to work in a setting where the projection values are corrupted by additive Gaussian noise, but it requires a significantly larger number of measurements.
Online Learning Without Prior Information (Best Student Paper Award)
ABSTRACT. The vast majority of optimization and online learning algorithms today require
some prior information about the data (often in the form of bounds on gradients or on the optimal parameter value). When this information is not available, these algorithms require laborious manual tuning of various hyperparameters, motivating the search for algorithms that can adapt to the data with no prior information. We describe a frontier of new lower bounds on the performance of such algorithms, reflecting a tradeoff between a term that depends on the optimal parameter value and a term that depends on the gradients' rate of growth. Further, we construct a family of algorithms whose performance matches any desired point on this frontier, which no previous algorithm reaches.
On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities
ABSTRACT. We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed \emph{regret bounds}), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence, we prove exponential tail bounds for norms of Banach space valued martingales via deterministic regret bounds for the online mirror descent algorithm with an adaptive step size. We show that the phenomenon extends beyond the setting of online linear optimization and present the equivalence for the supervised online learning setting.
Fast rates for online learning in Linearly Solvable Markov Decision Processes
ABSTRACT. We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order $\log^2 T$ after $T$ rounds, vastly improving on the best known regret bound of order $T^{3/4}$ for this setting.
ZIGZAG: A new approach to adaptive online learning
ABSTRACT. We develop a novel family of algorithms for the online learning setting with regret against any data sequence bounded by the empirical Rademacher complexity of that sequence. To develop a general theory of when this type of adaptive regret bound is achievable we establish a connection to the theory of decoupling inequalities for martingales in Banach spaces. When the hypothesis class is a set of linear functions bounded in some norm, such a regret bound is achievable if and only if the norm satisfies certain decoupling inequalities for martingales. Donald Burkholder’s celebrated geometric characterization of decoupling inequalities states that such an inequality holds if and only if there exists a special function called a Bellman function satisfying certain restricted concavity properties. Our online learning algorithms are efficient in terms of queries to this function.
We realize our general theory by giving novel efficient algorithms for classes including lp norms, spectral norm, group norms, operator norms, and reproducing kernel Hilbert spaces. The empirical Rademacher complexity regret bound implies — when used in the i.i.d. setting — a data-dependent complexity bound for excess risk after online-to-batch conversion. To obtain such adaptive methods, we introduce novel machinery, and the resulting algorithms are not based on the standard tools of online convex optimization.
Efficient Co-Training of Linear Separators under Weak Dependence
ABSTRACT. We develop the first polynomial-time algorithm for co-training of
homogeneous linear separators under a relaxation of the independence
condition known as weak dependence, and assuming an
inverse-polynomial margin.
ABSTRACT. The abundance of unlabeled data makes semisupervised learning (SSL) an attractive approach
for improving the accuracy of learning systems. However, we are still far from a complete analysis of the benefits of this learning scenario in terms of sample complexity. In particular, for many natural learning settings it can in fact be shown that SSL does not improve sample complexity. Thus far, the only case where SSL provably helps, without compatibility, is a recent combinatorial construction of \cite{darnstadt2013unlabeled}. Deriving similar theoretical guarantees for more commonly used approaches to SSL remains a challenge. Here, we provide the first analysis of manifold based SSL, where there is a provable gap between supervised learning and SSL, and this gap can be arbitrarily large. Proving the required lower bound is a technical challenge, involving tools from geometric measure theory. The algorithm analyzed is similar to subspace clustering, thus endowing this method with strong theoretical guarantees.
Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes
ABSTRACT. In this work we study the quantitative relation between the recursive teaching dimension (RTD) and the VC dimension (VCD) of concept classes of finite sizes. The RTD of a concept class C ⊆ {0, 1}^n , introduced by Zilles et al. (2011), is a combinatorial complexity measure characterized by the worst-case number of examples necessary to identify a concept in C according to the recursive teaching model.
For any finite concept class C ⊆ {0, 1}^n with VCD(C) = d, Simon and Zilles (2015) posed an open problem RTD(C) = O(d), i.e., is RTD linearly upper bounded by VCD? Previously, the best known result is an exponential upper bound RTD(C) = O(d·2^d), due to Chen et al. (2016). In this paper, we show a quadratic upper bound: RTD(C) = O(d^2), much closer to an answer to the open problem. We also discuss the challenges in fully solving the problem.
ABSTRACT. Let $\FF$ be a set of boolean functions. We give an algorithm for learning $\FF_\vee:=\{\vee_{f\in S}f\ |\ S\subseteq \FF\}$ from membership queries. Our algorithm asks at most $|\FF|\cdot \OPT(\FF_\vee)$ membership queries where $\OPT(\FF_\vee)$ is the minimum worst case number of membership queries for learning $\FF_\vee$. When $\FF$ is a set of halfspaces over a constant dimension space or a set of variable inequalities, our algorithm runs in a polynomial time.
The problem we address has a practical importance in the field of program synthesis, where the goal is to synthesize a program meeting some requirements. Program synthesis has become popular especially in settings aimed to help end users. In such settings, the requirements are not provided upfront and the synthesizer can only learn them by posing membership queries to the end user. Our work completes such synthesizers with the ability to learn the exact requirements while bounding the number of membership queries.