Generalization for Adaptively-chosen Estimators via Stable Median
ABSTRACT. In this work, we consider the problem of answering adaptively-chosen queries about some unknown data distribution given only a small i.i.d. sample from that distribution. This problem was introduced by Dwork et al. (2014) to investigate algorithms that generalize even when data is reused in multiple data analysis steps where each analysis may depend on outcomes of previous analyses. We show how to answer arbitrary real-valued queries in this setting with accuracy guarantees that are distribution specific. Previous techniques were limited to queries that have low sensitivity in the worst-case and their accuracy guarantees do not adapt to the data distribution (Bassily et al. 2016). We also give a version of our algorithm that can be used to verify answers to such general queries (as in the reusable holdout technique from (Dwork et al. 2015)).
Our algorithm is based on a simple approximate median algorithm that satisfies the strong stability guarantees of differential privacy. Our techniques provide a new approach for analysis of the generalization guarantees of differentially private algorithms.
ABSTRACT. We consider learning a predictor which is non-discriminatory with respect to a ``protected attribute'' according to the notion of ``equalized odds'' proposed by Hardt et al. (2016). We study the problem of learning such a non-discriminatory predictor from a finite training set, both statistically and computationally. We show that a post-hoc correction approach, as suggested by Hardt et al, can be highly suboptimal, present a nearly-optimal statistical procedure, argue that the associated computational problem is intractable, and suggest a second moment relaxation of the non-discrimination definition for which learning is tractable.
ABSTRACT. In the differentially private top-k selection problem, we are given a dataset X in {-1,+1}^{n x d}, in which each row belongs to an individual and each column corresponds to some binary attribute, and our goal is to find a set of k << d columns whose means are approximately as large as possible. Differential privacy requires that our choice of these k columns does not depend too much on any on individual's dataset. This problem can be solved using the well known exponential mechanism and composition properties of differential privacy. In the high-accuracy regime, where we require the error of the selection procedure to be to be smaller than the so-called sampling error \alpha ~ \sqrt{ln(d)/n}, this procedure succeeds given a dataset of size n ~ k ln(d)$.
We prove a matching lower bound, showing that a dataset of size n ~ k ln(d)$ is necessary for private top-k selection in this high-accuracy regime. Our lower bound shows that selecting the k largest columns requires more data than simply estimating the value of those k columns, which can be done using a dataset of size just n ~ k.
ABSTRACT. In recent years crowdsourcing has become the method of choice for gathering labeled training data for learning algorithms. Standard approaches to crowdsourcing view the process of acquiring labeled data through crowdsourcing separately than the process of learning a classifier from the gathered data. This can give rise to computational and statistical challenges. For example, often there are no known computationally efficient learning algorithms that are robust to the high level of noise that exists in crowdsourced data, and efforts to eliminate noise through voting often require a large number of queries per example.
In this paper, we show how by interleaving the process of labeling and learning, we can attain computational efficiency with much less overhead in labeling cost. In particular, we consider a realizable setting where there exists a true target function in F and consider a pool of labelers. When a noticeable fraction of the labelers are perfect, we show that any F that can be learned in the traditional realizable PAC model can be learned in a computationally efficient manner by querying the crowd despite high amounts of noise in the labeler responses. Moreover, we show that this can be done while each labeler only labels a constant number of examples and the number of labels requested per example, on average, is a constant. When no perfect labelers exist, a related task is to find a set of the labelers which are good but not perfect. We show that we can identify all good labelers, when at least the majority of labelers are good.
A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics (Best Paper Award)
ABSTRACT. We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for non-convex optimization. The algorithm performs a stochastic gradient update, where in each step it injects appropriately scaled Gaussian noise to the update. We analyze the algorithm's hitting time to an arbitrary subset of the parameter space. Two results follow from our general theory: First, we prove that for empirical risk minimization, if the empirical risk is point-wise close to the (smooth) population risk, then the algorithm achieves an approximate local minimum of the population risk in polynomial time, escaping suboptimal local minima that only exist in the empirical risk. Second, we show that SGLD improves on one of the best known learnability results for learning linear classifiers under the zero-one loss.
Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis
ABSTRACT. Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand & Mitter, 1991).
The present work provides a nonasymptotic analysis in the context of non-convex learning problems: SGLD requires $O(\epsilon^{-4})$ iterations to sample $O(\epsilon)$-approximate minimizers of both empirical and population risk, where $O(\cdot)$ hides polynomial dependence on a temperature parameter, the model dimension, and a certain spectral gap parameter.
As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance.
Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent
ABSTRACT. In this paper, we revisit the recently established theoretical guarantees
for the convergence of the Langevin Monte Carlo algorithm of sampling from
a smooth and (strongly) log-concave density. We improve the existing results
when the convergence is measured in the Wasserstein distance and provide
further insights on the very tight relations between, on the one hand, the
Langevin Monte Carlo for sampling and, on the other hand, the gradient
descent for optimization. Finally, we also establish guarantees for the
convergence of a version of the Langevin Monte Carlo algorithm that is based
on noisy evaluations of the gradient.
Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo
ABSTRACT. This paper presents a detailed theoretical analysis of the Langevin Monte Carlo sampling algorithm recently introduced in
(Durmus et al. 2016) when applied to log-concave probability
distribution that are restricted to a convex body K. This method relies on
a regularisation procedure involving the Moreau-Yosida envelope of the
indicator function associated with K. Explicit
convergence bounds in total variation norm and in Wasserstein
distance of order $1$ are established. In particular, we show that
the complexity of this algorithm given a first order oracle is
polynomial in the dimension of the state space. Finally, some
numerical experiments are presented to compare our method with
competing MCMC approaches from the literature.
Fast Rates for Empirical Risk Minimization of Strict Saddle Problems
ABSTRACT. We derive bounds on the sample complexity of empirical risk
minimization (ERM) in the context of minimizing non-convex risks that admit
the strict saddle property. Recent progress in non-convex optimization
has yielded efficient algorithms for minimizing such functions. Our
results imply that these efficient algorithms are statistically stable
and also generalize well. In particular, we derive fast rates which resemble the bounds
that are often attained in the strongly convex setting. We specify our
bounds to Principal Component Analysis and
Independent Component Analysis. Our results and techniques may pave
the way for statistical analyses of additional strict saddle problems.
ABSTRACT. The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments:
\begin{itemize}
\item in lossy population recovery, a pollee may skip each question with probability $\epsilon$;
\item in noisy population recovery, a pollee may lie on each question with probability $\epsilon$.
\end{itemize}
Given $n$ lossy or noisy samples, the goal is to estimate the probabilities of
all $2^d$ binary vectors simultaneously within accuracy $\delta$ with high probability.
This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is $\tilde\Theta(\delta^{ -2\max\{\frac{\epsilon}{1-\epsilon},1\}})$, improving the state of the art by
\cite{MS13} in several ways: a lower bound is established, the upper bound is improved and the result is dimension-free. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when $\epsilon$ exceeds $1/2$. For noisy population recovery, the sharp sample complexity turns out to be dimension-dependent and scales as $\exp(\Theta(d^{1/3} \log^{2/3}(1/\delta)))$ except for the trivial cases of $\epsilon=0,1/2$ or $1$.
For both models, our estimators simply compute the empirical mean of a certain
function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam's method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.
ABSTRACT. The noisy population recovery problem is a statistical inference problem, which is a special case of the problem
of learning mixtures of product distributions. Given an unknown distribution on $n$-bit strings with support of size $k$,
and given access only to noisy samples from it, where each bit is flipped independently with some unknown noise probability,
estimate from a few samples the underlying parameters of the model. Previous work designed polynomial time algorithms De et al.
which work under the assumption that the noise parameters are exactly known. In this work, we remove this assumption, and show how to
recover all the underlying parameters, even when the noise is unknown, in quasi-polynomial time.
ABSTRACT. We study the problem of estimating multivariate log-concave probability density functions.
We prove the first sample complexity upper
bound for learning log-concave densities on $\R^d$, for all $d \geq 1$.
Prior to our work, no upper bound on the sample complexity of this learning problem
was known for the case of $d>3$.
In more detail, we give an estimator that, for any $d \ge 1$ and $\eps>0$,
draws $\tilde{O}_d \left( (1/\eps)^{(d+5)/2} \right)$ samples from an unknown target log-concave density
on $\R^d$, and outputs a hypothesis that (with high probability) is $\eps$-close to the target,
in total variation distance.
Our upper bound on the sample complexity comes close to the known lower bound of
$\Omega_d \left( (1/\eps)^{(d+1)/2} \right)$ for this problem.
Ten Steps of EM Suffice for Mixtures of Two Gaussians
ABSTRACT. The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, $\tilde{\Omega}(d/\epsilon^2)$ samples suffice to compute the unknown vectors to within $\epsilon$ in Mahalanobis distance.
ABSTRACT. We introduce the following hidden hubs model H(n,k,\sigma_0, \sigma_1): the input is an n by n random matrix A with a subset S of k special rows (hubs); entries in rows outside S are generated from the Gaussian distribution p_0 = N(0,\sigma_0^2), while for each row in S, some k of its entries are generated from p_1 = N(0,\sigma_1^2), \sigma_1>\sigma_0, and the rest of the entries from p_0. The special rows with higher variance entries can be viewed as hidden higher-degree hubs. The problem we address is to identify them efficiently. The planted Gaussian Submatrix Model is the case where the special entries must all lie in a k by k submatrix. If k\geq c\sqrt{n\ln n}, just the row sums are sufficient to find S in the general model. For the Gaussian submatrix problem (and the related planted clique problem), this can be improved by a \sqrt{\ln n} factor to k \ge c\sqrt{n} by spectral or combinatorial methods.
We give a polynomial-time algorithm to identify all the hidden hubs with high probability for k \ge n^{0.5-\delta} for some \delta >0, when \sigma_1^2>2\sigma_0^2. The algorithm extends to the setting where planted entries might have different variances, each greater than 2\sigma_0^2.
We also show a nearly matching lower bound: for \sigma_1^2 \le 2\sigma_0^2, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from N(0,\sigma_0^2) and a matrix with k=n^{0.5-\delta} hidden hubs for any \delta >0. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value \sigma_1^2=2\sigma_0^2, we show that the hidden hubs problem can be solved for k\geq c\sqrt n(\ln n)^{1/4}, improving on the naive row sum-based method.
ABSTRACT. In the classical multi-armed bandit problem, d arms are available to the decision maker who pulls them sequentially in order to maximise his cumulative reward. Guarantees can be obtained on a relative quantity called regret, which scales linearly with d (or with √d in the minimax sense). We here consider the sparse case of this classical problem in the sense that only a small number of arms, namely s < d, have a positive expected reward. We are able to leverage this additional assumption to provide an algorithm whose regret scales with s instead of d. Moreover, we prove that this algorithm is optimal by providing a matching lower bound – at least for a wide and pertinent range of parameters that we determine – and by evaluating its performance on simulated data.
An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits
ABSTRACT. We present a new strategy for gap estimation in randomized algorithms for multiarmed bandits and combine it with the EXP3++ algorithm of Seldin and Slivkins (2014). In the stochastic regime the strategy reduces dependence of regret on a time horizon from $(\ln t)^3$ to $(\ln t)^2$ and replaces an additive factor of order $\Delta e^{1/\Delta^2}$ by an additive factor of order $1/\Delta^7$, where $\Delta$ is the minimal gap of a problem instance. In the adversarial regime regret guarantee remains unchanged.
ABSTRACT. We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds.
Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.
Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization
ABSTRACT. In this paper, we consider the problem of sequentially optimizing a black-box function $f$ based on noisy samples and bandit feedback. We assume that $f$ is smooth in the sense of having a bounded norm in some reproducing kernel Hilbert space (RKHS), a commonly-considered non-Bayesian form of Gaussian process optimization. We provide algorithm-independent lower bounds on the instantaneous regret, measuring the suboptimality of a single point reported after $T$ rounds, and on the cumulative regret, measuring the sum of regrets over all the chosen points.
For the isotropic squared-exponential kernel in $d$ dimensions, we find that an average instantaneous regret of $\epsilon$ requires $T = \Omega\big(\frac{1}{\epsilon^2} (\log\frac{1}{\epsilon})^{d/2}\big)$, and the average cumulative regret is at least $\Omega\big( \sqrt{T(\log T)^d} \big)$, thus matching existing upper bounds up to the replacement of $d/2$ by $d+O(1)$ in both cases. For the Mat\'ern-$\nu$ kernel, we give analogous bounds of the form $\Omega\big( (\frac{1}{\epsilon})^{2+d/\nu}\big)$ and $\Omega\big( T^{\frac{\nu + d}{2\nu + d}} \big)$, and discuss the resulting gaps to the existing upper bounds.
Towards Instance Optimal Bounds for Best Arm Identification
ABSTRACT. In the classical best arm identification (\bestarm) problem, we are given $n$ stochastic bandit arms, each associated with a reward distribution with an unknown mean. Upon each play of an arm, we can get a reward sampled i.i.d. from its reward distribution. We would like to identify the arm with the largest mean with probability at least $1-\delta$, using as few samples as possible. The problem has a long history and understanding its sample complexity has attracted significant attention since the last decade. However, the optimal sample complexity of the problem is still unknown.
Recently, Chen and Li made an interesting conjecture, called gap-entropy conjecture, concerning the instance optimal sample complexity of \bestarm~\cite{chen2016open}. Given a \bestarm\ instance, let $\mu_{[i]}$ denote the $i$th largest mean and $\Gap{i}=\mu_{[1]}-\mu_{[i]}$ denote the corresponding gap. $H(I)=\sum_{i=2}^{n}\Gap{i}^{-2}$ denotes the complexity of the instance. The gap-entropy conjecture states that for any instance $I$, $\Omega\left(H(I)\cdot\left(\ln\delta^{-1} + \arment(I)\right)\right)$ is an instance lower bound, where $\arment(I)$ is an entropy-like term completely determined by $\Gap{i}$'s, and there is a \CORRECT\ algorithm for \bestarm{} with sample complexity $O\left(H(I)\cdot\left(\ln\delta^{-1} + \arment(I)\right)+\Gap{2}^{-2}\ln\ln \Gap{2}^{-1}\right)$. We note that $\Gap{2}^{-2}\ln\ln \Gap{2}^{-1}$ is necessary and sufficient to solve the two-arm instance with the best and second best arms. If the conjecture is true, we would have a complete understanding of the instance-wise sample complexity of \bestarm.
In this paper, we make significant progress towards a complete resolution of the gap-entropy conjecture, from both the upper and lower bound sides. For the upper bound, we provide a highly nontrivial \CORRECT\ algorithm which requires, for any instance $I$, at most $$O\left(H(I)\cdot\left(\ln\delta^{-1} + \arment(I)\right)+\Gap{2}^{-2}\ln\ln \Gap{2}^{-1}\polylog(n,\delta^{-1})\right)$$ samples in expectation. For the lower bound, we show that for any Gaussian \bestarm\ instance with all gaps of the form $2^{-k}$ ($k\in \mathbb{N}$), any \CORRECT\ monotone algorithm requires at least $$\Omega\left(H(I)\cdot\left(\ln\delta^{-1} + \arment(I)\right)\right)$$ samples in expectation. Here, a monotone algorithm is one which uses no more samples (in expectation) on $I'$ than on $I$, if $I'$ is a sub-instance of $I$ obtained by removing some sub-optimal arms.
ABSTRACT. We extend the model of Multi-armed Bandit with unit switching cost to
incorporate a metric between the actions.
We consider the case where the metric
over the actions can be modeled by a complete binary tree, and the distance between
two leaves is the size of the subtree of their least common
ancestor, which abstracts the case that the actions are points on the
continuous interval $[0,1]$ and the switching cost is their
distance.
In this setting, we give a new algorithm that establishes a regret of $\tO(\sqrt{kT} + T/k)$, where $k$ is the number of actions and $T$ is the time horizon.
When the set of actions corresponds to whole $[0,1]$ interval we can exploit our method for the task of bandit learning with Lipschitz loss functions, where our algorithm achieves an optimal regret rate of $\tTheta(T^{2/3})$, which is the same rate one obtains when there is no penalty for movements.
As our main application, we use our new algorithm to solve an adaptive pricing problem.
Specifically, we consider the case of a single seller faced with a stream of patient buyers. Each buyer has a private value and a window of time
in which they are interested in buying, and they buy at the lowest price in the window, if it is below their value.
We show that with an appropriate discretization of the prices, the seller can achieve a regret of $\tO(T^{2/3})$ compared to the best fixed price in hindsight, which outperform the previous regret bound of $\tO(T^{3/4})$ for the problem.
Tight Bounds for Bandit Combinatorial Optimization
ABSTRACT. We revisit the study of optimal regret rates in bandit combinatorial optimization---a fundamental framework for sequential decision making under uncertainty that abstracts numerous combinatorial prediction problems.
We prove that the attainable regret in this setting grows as $\widetilde{\Theta}(k^{3/2}\sqrt{dT})$ where d is the dimension of the problem and k is a bound over the maximal instantaneous loss, disproving a conjecture of Audibert, Bubeck and Lugosi (2013) who argued that the optimal rate should be of the form $\widetilde{\Theta}(k\sqrt{dT})$. Our bounds apply to several important instances of the framework, and in particular, imply a tight bound for the well-studied bandit shortest path problem. By that, we also resolve an open problem posed by Cesa-Bianchi and Lugosi (2012).
Online Nonparametric Learning, Chaining, and the Role of Partial Feedback
ABSTRACT. We investigate contextual online learning with nonparametric (Lipschitz) comparison classes under different assumptions on losses and feedback information. For full information feedback and Lipschitz losses, we characterize the minimax regret up to log factors by proving an upper bound matching a previously known lower bound. In a partial feedback model motivated by second-price auctions, we prove upper bounds for Lipschitz and semi-Lipschitz losses that improve on the known bounds for standard bandit feedback. Our analysis combines novel results for contextual second-price auctions with a novel algorithmic approach based on chaining. When the context space is Euclidean, our chaining approach is efficient and delivers an even better regret bound.
Open Problem: First-Order Regret Bounds for Contextual Bandits
ABSTRACT. We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of $\widetilde{\mathcal{O}}(\sqrt{L_\star K \ln N})$ where there are $K$ actions, $N$ policies, and $L_\star$ is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret $\widetilde{\mathcal{O}}(L_\star^{2/3}\textrm{poly}(K, \ln(N/\delta)))$. We describe some positive results, such as an inefficient algorithm for the second problem, and some partial negative results.
Open Problem: Meeting Times for Learning Random Automata
ABSTRACT. Learning automata is a foundational problem in computational learning theory. However, even efficiently learning random DFAs is hard. A natural restriction of this problem is to consider learning random DFAs under the uniform distribution. To date, this problem has no non-trivial lower bounds nor algorithms faster than brute force. In this note, we propose a method to find faster algorithms for this problem. We reduce the learning problem to a conjecture about meeting times of random walks over random DFAs, which may be of independent interest to prove.