previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 16
Langevin Monte Carlo and JKO splitting

ABSTRACT. Algorithms based on discretizing Langevin diffusion are popular tools for sampling from high-dimensional distributions. We develop novel connections between such Monte Carlo algorithms, the theory of Wasserstein gradient flow, and the operator splitting approach to solving PDEs. In particular, we show that a proximal version of the Unadjusted Langevin Algorithm corresponds to a scheme that alternates between solving the gradient flows of two specific functionals on the space of probability measures. Using this perspective, we derive some new non-asymptotic results on the convergence properties of this algorithm.

Log-concave sampling: Metropolis-Hastings algorithms are fast!

ABSTRACT. We consider the problem of sampling from a strongly log-concave density in R^d, and prove a non- asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by running a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step to ensure the correct stationary distribution. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds reveal that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most δ for a density with condition number κ, we show that MALA requires O(κd log(1/δ)) steps, as compared to the O(κ^2d/δ^2) steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for a zeroth-order method Metropolized random walk (MRW) and show that it mixes O(κd) slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the potential gains of Metropolis-Hastings adjustment for Langevin-type algorithms.

Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

ABSTRACT. We study sampling as optimization in the space of measures. We focus on gradient flow-based optimization with the Langevin dynamics as a case study. We investigate the source of the bias of the unadjusted Langevin algorithm (ULA) in discrete time, and consider how to remove or reduce the bias. We point out the difficulty is that the heat flow is exactly solvable, but neither its forward nor backward method is implementable in general, except for Gaussian data. We propose the symmetrized Langevin algorithm, which should have a smaller bias than ULA, at the price of implementing a proximal gradient step in space. We show SLA is in fact consistent for Gaussian target measure, whereas ULA is not. We also illustrate various algorithms explicitly for Gaussian target measure.

Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

ABSTRACT. We study the generalization errors of non-convex regularized ERM procedures using Stochastic Gradient Langevin Dynamics (SGLD). Two theories are proposed with non-asymptotic discrete-time analysis, using stability and PAC-Bayesian theory respectively. The stability-based theory obtains a bound of O(L/n*sqrt(b*T_n)) where L is Lipschitz parameter, b is inverse temperature, and T_N is the sum of step sizes. For PAC-Bayesian theory, though the bound has a slower O(1/sqrt(n)) rate, the contribution of each step decays exponentially through time, and the uniform Lipschitz constant is also replaced by actual norms of gradients along the optimization trajectory. Our bounds have reasonable dependence on aggregated step sizes, and do not explicitly depend on dimensions, norms or other capacity measures of parameter, which characterize how the algorithm itself controls the statistical learning behavior in non-convex problems.

Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability
SPEAKER: Belinda Tzen

ABSTRACT. We study the detailed path-wise behavior of the discrete-time Langevin algorithm for non-convex Empirical Risk Minimization (ERM) through the lens of metastability, adopting some techniques from Berglund and Gentz (2003).
For a particular local optimum of the empirical risk, with an arbitrary initialization, we show that, with high probability, at least one of the following two events will occur: (1) the Langevin trajectory ends up somewhere outside the $\varepsilon$-neighborhood of this particular optimum within a short recurrence time; (2) it enters this $\varepsilon$-neighborhood by the recurrence time and stays there until a potentially exponentially long escape time. We call this phenomenon empirical metastability.
This two-timescale characterization aligns nicely with the existing literature in the following two senses. First, the effective recurrence time (i.e., number of iterations multiplied by stepsize) is dimension-independent, and resembles the convergence time of continuous-time deterministic Gradient Descent (GD). However unlike GD, the Langevin algorithm does not require strong conditions on local initialization, and has the possibility of eventually visiting all optima. Second, the scaling of the escape time is consistent with the Eyring-Kramers law, which states that the Langevin scheme will eventually visit all local minima, but it will take an exponentially long time to transit among them. We apply this path-wise concentration result in the context of statistical learning to examine local notions of generalization and optimality.

Underdamped Langevin MCMC: A non-asymptotic analysis

ABSTRACT. We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $\varepsilon$ error (in 2-Wasserstein distance) in $\mathcal{O}(\sqrt{d}/\varepsilon)$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is $\mathcal{O}(d/\varepsilon^2)$ steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.

10:00-10:30Coffee Break
10:30-12:00 Session 17
Incentivizing Exploration by Heterogeneous Users

ABSTRACT. We consider the problem of incentivizing exploration with heterogeneous agents. In this problem, bandit arms provide vector-valued outcomes equal to an unknown arm-specific attribute vector, perturbed by independent normally distributed noise. Agents arrive sequentially and choose arms to pull based on their own private and heterogeneous linear utility functions over attributes and the estimates of the arms' attribute vectors derived from observations of other agents' past pulls. Agents are myopic and selfish and thus would choose the arm with maximum estimated utility. A principal, knowing only the distribution from which agents' preferences are drawn, but not the specific draws, can offer incentive payments for pulling specific arms in order to encourage agents to explore underplayed arms. The principal seeks to minimize the total expected cumulative regret incurred by agents relative to their best arms, while also making a small expected cumulative payment.

We propose an algorithm that incentivizes arms played infrequently in the past whose probability of being played in the next round would be small without incentives. Under the assumption that each arm is preferred by at least a fraction $p>0$ of agents, we show that this algorithm achieves expected cumulative regret of $O(N\exp(2/p) + N \log^3(T))$, using expected cumulative payments of $O(N^2\exp(2/p))$. If $p$ is known or the distribution over agent preferences is discrete, the exponential term $\exp(2/p)$ can be replaced with suitable polynomials in $N$ and $1/p$. For discrete preferences, the regret dependence on $T$ can be eliminated entirely, giving constant (depending only polynomially on $N$ and $1/p$ expected regret and payments. This constant regret stands in contrast to the $\Theta(\log(T))$ dependence of regret in standard multi-armed bandit problems. It arises because even unobserved heterogeneity in agent preferences allows exploitation of arms to also explore arms fully; succinctly, heterogeneity provides free exploration.

A General Approach to Multi-Armed Bandits Under Risk Criteria

ABSTRACT. Different risk-related criteria have received recent interest in learning problems, where typically each case is treated in a customized manner. In this paper we provide a more systematic approach to analyzing such risk criteria within a stochastic multi-armed bandit (MAB) formulation. We identify a set of general conditions that yield a simple characterization of the oracle rule (which serves as the regret benchmark), and facilitate the design of upper confidence bound (UCB) learning policies. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.

Nonstochastic Bandits with Composite Anonymous Feedback

ABSTRACT. We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

More Adaptive Algorithms for Adversarial Bandits

ABSTRACT. We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free.

The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.

Best of both worlds: Stochastic & adversarial best-arm identification

ABSTRACT. We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: \emph{Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards?} First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.

Efficient Contextual Bandits in Non-stationary Worlds

ABSTRACT. Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i.i.d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution.

We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret $\mathcal{O}(\sqrt{ST})$ if there are $T$ rounds with $S$ stationary periods, or more generally $\mathcal{O}(\Delta^{1/3}T^{2/3})$ where $\Delta$ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit.

Furthermore, while the results above require tuning a parameter based on the unknown quantity $S$ or $\Delta$, we also develop a parameter free algorithm achieving regret $\min\{S^{1/4}T^{3/4}, \Delta^{1/5}T^{4/5}\}$. This improves and generalizes the best existing result $\Delta^{0.18}T^{0.82}$ by Karnin and Anava (2016) which only holds for the two-armed bandit problem.

Adaptivity to Smoothness in X-armed bandits

ABSTRACT. We study the stochastic continuum-armed bandit problem from the angle of adaptivity to unknown regularity of the reward function f. We prove that there exists no strategy for the cumulative regret that adapts optimally to the smoothness of f. We show however that such minimax optimal adaptive strategies exist if the learner is given extra-information about f. Finally, we complement our positive results with matching lower bounds.

Small-loss bounds for online learning with partial information

ABSTRACT. We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems where the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called "small-loss'' $o(\alpha L^{\star})$ regret bounds with high probability, where $\alpha$ is the independence number of the graph, and $L^{\star}$ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e. utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications such as semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), as well as learning with slowly changing (shifting) comparators.

In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $\widetilde{\bigO}(\sqrt{dL^{\star}})$ for actual regret, where $d$ is the number of actions, answering open questions of Neu. Previous bounds for bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal $\widetilde{\bigO}(\sqrt{\kappa L^{\star}})$ regret guarantee for fixed feedback graphs with clique-partition number at most $\kappa$.

Information Directed Sampling and Bandits with Heteroscedastic Noise

ABSTRACT. In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and to satisfy a tail bound uniformly on the domain; a restrictive assumption for many applications. In this work, we consider bandits with heteroscedastic noise, where we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret analysis framework, that is similar to the Bayesian framework of Russo and Van Roy (2014), and we prove a new high-probability regret bound for general, possibly randomized policies, which depends on a quantity we refer to as regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize the regret-information ratio over all possible action sampling distributions. This further relies on concentration inequalities for online least squares regression in separable Hilbert spaces, which we generalize to the case of heteroscedastic noise. We then formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding novel algorithms for Bayesian optimization. We also prove frequentist regret bounds, which in the homoscedastic case recover known bounds for UCB, but can be much better when the noise is heteroscedastic. Empirically, we demonstrate in a linear setting with heteroscedastic noise, that some of our methods can outperform UCB and Thompson Sampling, while staying competitive when the noise is homoscedastic.

12:00-14:00Lunch Break
14:00-15:00 Session 18: Invited talk: Johan Hastad
Some problems that I like connected to small-depth circuits and hardness of approximation

ABSTRACT. The talk consists of two separate parts.

In the first part we describe some theorems about small-depth circuits that were proved in the last five years but should have been proved 30 years ago. This includes sharp bounds for the correlation of parity with small depths circuits and an average case hierarchy theorem. In the second half I describe, what I feel are three useful problems in hardness of approximation. They are Label Cover, approximating constraint satisfaction problems with the Hadamard predicate (which I define, this is a result due to Siu On Chan) and a promise variant of k-Sat.

15:20-16:20 Session 19
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

ABSTRACT. We show that the (stochastic) gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations.

Concretely, we show that given $\tilde{O}(dr^{2})$ random linear measurements of a rank $r$ positive semidefinite matrix $X^*$, we can recover $X^*$ by parameterizing it by $UU^\top$ with $U \in \R^{d\times d}$ and minimizing the squared loss, even if $r$ is much less than $d$. We prove that starting from a small initialization, gradient descent recovers $X^*$ in $\tilde{O}(\sqrt{r})$ iterations approximately. The results solve the conjecture of Gunasekar et al. 17 under the restricted isometry property.

The technique can be applied to analyzing neural networks with quadratic activations with some technical modifications.

Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

ABSTRACT. We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another in which the task is to invert a generative neural network given only compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. Hence, there are no local minima, saddle points, or other stationary points outside these neighborhoods. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and they bridge the gap between the empirical success of enforcing deep generative priors and a rigorous understanding of non-linear inverse problems.

Size-Independent Sample Complexity of Neural Networks

ABSTRACT. We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.

Optimal approximation of continuous functions by very deep ReLU networks

ABSTRACT. We prove that deep ReLU neural networks with conventional fully-connected architectures with $W$ weights can approximate continuous $\nu$-variate functions $f$ with uniform error not exceeding $a_\nu\omega_f(c_\nu W^{-2/\nu}),$ where $\omega_f$ is the modulus of continuity of $f$ and $a_\nu, c_\nu$ are some $\nu$-dependent constants. This bound is tight. Our construction is inherently deep and nonlinear: the obtained approximation rate cannot be achieved by networks with fewer than $\Omega(W/\ln W)$ layers or by networks with weights continuously depending on $f$.

16:30-17:30 Session 20
Subpolynomial trace reconstruction for random strings and arbitrary deletion probability

ABSTRACT. The deletion-insertion channel takes as input a string $x$ of $n$ bits and outputs a string where bits have been deleted and inserted independently at random. The trace reconstruction problem is to recover $x$ from many independent outputs (called ``traces'') of the deletion-insertion channel applied to $x$. We show that if $x$ is chosen uniformly at random, then $\exp(O(\log^{1/3} n))$ traces suffice to reconstruct $x$ with high probability. The earlier upper bounds were $\exp(O(\log^{1/2} n))$ for the deletion channel with deletion probability $q <1/2$, and $\exp(O( n^{1/3}))$ for the deletion channel with $q \ge 1/2$ or when insertions are allowed.

A key ingredient in our proof is a delicate two-step alignment procedure where we estimate the location in each trace corresponding to a given bit of $x$. The alignment is done by viewing the strings as random walks, and comparing the increments in the walk associated with the input string and the trace, respectively.

Detection limits in the high-dimensional spiked rectangular model

ABSTRACT. We study the problem of detecting the presence of a single unknown spike in a rectangular data matrix, in a high-dimensional regime where the spike has fixed strength and the aspect ratio of the matrix converges to a finite limit. This situation comprises Johnstone's spiked covariance model. We analyze the likelihood ratio of the spiked model against an ``all noise" null model of reference, and show it has asymptotically Gaussian fluctuations in a region below---but in general not up to---the so-called BBP threshold from random matrix theory. Our result parallels earlier findings of Onatski et al.\ (2013) and Johnstone-Onatski (2015) for spherical spikes. We present a probabilistic approach capable of treating generic product priors. In particular, sparsity in the spike is allowed. Our approach operates through the general principle of the cavity method from spin-glass theory. The question of the maximal parameter region where asymptotic normality is expected to hold is left open. This region, not necessarily given by BBP, is shaped by the prior in a non-trivial way. We conjecture that this is the entire paramagnetic phase of an associated spin-glass model, and is defined by the vanishing of the replica-symmetric solution of Lesieur et al.\ (2015).

Hidden Integrality of SDP Relaxations for Sub-Gaussian Mixture Models

ABSTRACT. We consider the problem of finding discrete clustering structures under Sub-Gaussian Mixture Models. We establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded by the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models. A special case of this result shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our result establishes sufficient conditions for the SDP to correctly recover the cluster memberships of at least $(1-\delta)$ fraction of the points for any $\delta\in(0,1)$. Error bounds for estimating cluster centers can also be derived directly from our results.

An Analysis of the t-SNE Algorithm for Data Visualization

ABSTRACT. A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications.

This work gives a formal framework for the problem of data visualization - finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the "ground-truth" clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations.

We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in partially recovering cluster structure even when the above deterministic condition is not met.

Fitting a putative manifold to noisy data

ABSTRACT. In the present work, we give a solution to the following question from manifold learning. Suppose data belonging to a high dimensional Euclidean space is drawn independently, identically distributed from a measure supported on a low dimensional twice differentiable embedded manifold, and corrupted by a small amount of gaussian noise. How can we produce a manifold whose Hausdorff distance to the true manifold is small and whose reach is not much smaller than the reach of the true manifold?

Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods

ABSTRACT. The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $E$ in $\R^n$ so that data points projected onto $E$ follow a non-Gaussian distribution. Vempala and Xiao (2011) proposed a local search algorithm, and showed that it was able to estimate $E$ accurately with polynomial time and sample complexity, if the dimension of $E$ is treated as a constant and with the assumption that all one-dimensional marginals of the non-Gaussian distribution over $E$ have non-Gaussian moments. In this paper, we propose a simple spectral algorithm called \textsc{Reweighted PCA}, and prove that it possesses the same guarantee. The principle that underlies this approach is a new characterization of multivariate Gaussian distributions.

17:30-19:30 Session 21

Poster Session

19:30-23:00 Banquet

The banquet will be at artipelag:
It takes about 30 minutes to go there by bus. The busses will be leaving at 7:30 PM at the beginning of Malvinasväg (or equivalently Osquldasväg) on Drottning Kristinas väg: