The Externalities of Exploration and How Data Diversity Helps Exploitation

ABSTRACT. Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain individuals or groups. Motivated by these concerns, we initiate the study of the externalities of exploration---the undesirable side effects that the presence of one party may impose on another---under the linear contextual bandits model. We introduce the notion of a group externality, measuring the extent to which the presence one population of users (the majority) impacts the rewards of another (the minority). We show that this impact can, in some cases, be negative, and that, in a certain sense, no algorithm can avoid it. We then move on to study externalities at the individual level, interpreting the act of exploration as an externality imposed on the current user of a system by future users. This drives us to ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm that always chooses the action that currently looks optimal. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $\tilde{O}(T^{1/3})$. Returning to group-level effects, we show that under the same conditions, negative group externalities essentially vanish if one runs the greedy algorithm. Together, our results uncover a sharp contrast between the high externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse.

Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

ABSTRACT. We prove that the ordinary least-squares (OLS) estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory. Our upper bound analysis relies on a generalization of Mendelson’s small-ball method to dependent data, eschewing the use of standard mixing-time arguments. Our lower bounds reveal that these upper bounds match up to logarithmic factors. In particular, we capture the correct signal-to-noise behavior of the problem, showing that more unstable linear systems are easier to estimate. This behavior is qualitatively different from arguments which rely on mixing-time calculations that suggest that unstable systems are more difficult to estimate. We generalize our technique to provide bounds for a more general class of linear response time-series.

Testing Symmetric Markov Chains From a Single Trajectory

ABSTRACT. Classical distribution testing assumes access to i.i.d. samples from the distribution that is being tested. We initiate the study of Markov chain testing, assuming access to a single trajectory of a Markov Chain. In particular, we observe a single trajectory X_0 ,...,X_t ,... of an unknown, symmetric, and finite state Markov Chain M. We do not control the starting state X_0 , and we cannot restart the chain. Given our single trajectory, the goal is to test whether M is identical
to a model Markov Chain M_0 , or far from it under an appropriate notion of difference.
We propose a measure of difference between two Markov chains, motivated by the early work of Kazakos [Kaz78], which captures the scaling behavior of the total variation distance between trajectories sampled from the Markov chains as the length of these trajectories grows. We provide efficient testers and information-theoretic lower bounds for testing identity of symmetric Markov chains under our proposed measure of difference, which are tight up to logarithmic factors if the hitting times of the model chain M_0 is O(n) in the size of the state space n.

Action-Constrained Markov Decision Processes With Kullback-Leibler Cost

ABSTRACT. This paper concerns computation of optimal policies in which the one-step cost function contains a term that models Kullback-Leibler divergence with respect to nominal dynamics. This technique was introduced by Todorov in 2007, where it was shown under general conditions that the solution to the average-cost optimality equations reduce to a simple eigenvector problem. Since then many authors have sought to apply this technique to control problems and models of bounded rationality in economics.

A crucial assumption is that the input process is essentially unconstrained. For example, if the nominal dynamics include randomness from nature (e.g., the impact of wind on a moving vehicle), then the optimal control solution does not respect the exogenous nature of this disturbance.

This paper introduces a technique to solve a more general class of action-constrained MDPs. The main idea is to solve an entire parameterized family of MDPs, in which the parameter is a scalar weighting the one-step cost or reward function. The approach is new and practical even in the original unconstrained formulation.

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

ABSTRACT. Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a \emph{simple and explicit finite time analysis} of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature.

Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

ABSTRACT. Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as ``lock-in probability''. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.

A Faster Approximation Algorithm for the Gibbs Partition Function

ABSTRACT. We consider the problem of estimating the partition function $Z(\beta)=\sum_x \exp(-\beta H(x))$ of a Gibbs distribution with a Hamilton $H(\cdot)$, or more precisely the logarithm of the ratio $q=\ln Z(0)/Z(\beta)$. It has been recently shown how to approximate $q$ with high probability assuming the existence of an oracle that produces samples from the Gibbs distribution for a given parameter value in $[0,\beta]$. The current best known approach due to Huber (2015) uses $O(q\ln n\cdot[\ln q + \ln \ln n+\varepsilon^{-2}])$ oracle calls on average where $\varepsilon$ is the desired accuracy of approximation and $H(\cdot)$ is assumed to lie in $\{0\}\cup[1,n]$. We improve the complexity to $O(q\ln n\cdot\varepsilon^{-2})$ oracle calls. We also show that the same complexity can be achieved if exact oracles are replaced with approximate sampling oracles that are within $O(\frac{\varepsilon^2}{q\ln n})$ variation distance from exact oracles.Finally, we prove a lower bound of $\Omega(q\cdot \varepsilon^{-2})$ oracle calls under a natural model of computation.

The Mean-Field Approximation: Information Inequalities, Algorithms, and Complexity

ABSTRACT. The mean field approximation to the Ising model is a canonical variational tool that is used for analysis and inference in Ising models. We provide a simple and optimal bound for the KL error of the mean field approximation for Ising models on general graphs, and extend it to higher order Markov random fields. Our bound improves on previous bounds obtained in work in the graph limit literature by Borgs, Chayes, Lov{\'a}sz, S{\'o}s, and Vesztergombi and another recent work by Basak and Mukherjee. Our bound is tight up to lower order terms.

Building on the methods used to prove the bound, along with techniques from combinatorics and optimization,
we study the algorithmic problem of estimating the (variational) free energy for Ising models and general Markov random fields. For a graph $G$ on $n$ vertices and interaction matrix $J$ with Frobenius norm $\| J \|_F$, we provide algorithms that approximate the free energy within an additive error of $\epsilon n \|J\|_F$ in time $\exp(poly(1/\epsilon))$. We also show that approximation within $(n \|J\|_F)^{1-\delta}$ is NP-hard for every $\delta > 0$. Finally, we provide more efficient approximation algorithms,
which find the optimal mean field approximation, for ferromagnetic Ising models and for Ising models satisfying Dobrushin's condition.

The Vertex Sample Complexity of Free Energy is Polynomial

ABSTRACT. The free energy is a key quantity which is associated to Markov random fields.
Classical results in statistical physics show how, given an analytic formula of the free energy, it is possible to compute many key quantities associated with Markov random fields including quantities such as magnetization and the location of various phase transitions.
Given a massive Markov random field on $n$ nodes, can a small sample from it provide a rough approximation to the free energy $\F_n = \log{Z_n}$?

Results in graph limit literature by Borgs, Chayes, Lov{\'a}sz, S{\'o}s, and Vesztergombi show that
for Ising models on $n$ nodes and interactions of strength $\Theta(1/n)$, an $\epsilon$ approximation to $\log Z_n / n$ can be achieved by sampling a randomly induced model on $2^{O(1/\epsilon^2)}$ nodes. We show that the sampling complexity of this problem is {\em polynomial in }$1/\eps$. We further show a polynomial dependence on $\epsilon$ cannot be avoided.

Our results are very general as they apply to higher order Markov random fields. For Markov random fields of order $r$, we obtain an algorithm that achieves $\epsilon$ approximation using a number of samples polynomial in $r$ and $1/\epsilon$ and running time that is $2^{O(1/\epsilon^2)}$ up to polynomial factors in $r$ and $\epsilon$. For ferromagnetic Ising models, the running time is polynomial in $1/\epsilon$.

Our results are intimately connected to recent research on the regularity lemma and property testing, where the interest is in finding which properties can tested within $\epsilon$ error in time polynomial in $1/\epsilon$. In particular, our proofs build on results from a recent work by Alon, de la Vega, Kannan and Karpinski, who also introduced the notion of polynomial vertex sample complexity. Another critical ingredient of the proof is an effective bound by the authors of the paper relating the variational free energy and the free energy.

Optimal Single Sample Tests for Structured versus Unstructured Network Data

ABSTRACT. We study the problem of testing, using only a single sample, between mean field distributions (like Curie-Weiss, Erd\H{o}s-R\'enyi) and structured Gibbs distributions (like Ising model on sparse graphs and Exponential Random Graphs).
Our goal is to test without knowing the parameter values of the underlying models: only the \emph{structure} of dependencies is known.
We develop a new approach that applies to both the Ising and Exponential Random Graph settings based on a general and natural statistical test. The test can distinguish the hypotheses with high probability above a certain threshold in the (inverse) temperature parameter, and is optimal in that below the threshold no test can distinguish the hypotheses.

The thresholds do not correspond to the presence of long-range order in the models. By aggregating information at a global scale, our test works even at very high temperatures.
The proofs are based on distributional approximation and sharp concentration of quadratic forms, when restricted to Hamming spheres. The restriction to Hamming spheres is necessary, since otherwise any scalar statistic is useless without explicit knowledge of the temperature parameter. At the same time, this restriction radically changes the behavior of the functions under consideration, resulting in a much smaller variance than in the independent setting; this makes it impossible to directly apply standard methods (i.e., Stein's method) for concentration of weakly dependent variables. Instead, we carry out an additional tensorization argument using a Markov chain that respects the symmetry of the Hamming sphere.

Fundamental Limits of Weak Recovery with Applications to Phase Retrieval

ABSTRACT. In phase retrieval we want to recover an unknown signal $x\in\mathbb C^d$ from $n$ quadratic measurements of the form $y_i = ||^2+w_i$ where $a_i\in \mathbb C^d$ are known sensing vectors and $w_i$ is measurement noise. We ask the following weak recovery question: what is the minimum number of measurements $n$ needed to produce an estimator $\hat{x}(y)$ that is positively correlated with
the signal $x$?
We consider the case of Gaussian vectors $a_i$. We prove that -- in the high-dimensional limit -- a sharp phase transition takes place, and we locate the threshold in the regime of vanishingly small noise. For $n\le d-o(d)$ no estimator can do significantly better than random and achieve a strictly positive correlation. For $n\ge d+o(d)$ a simple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with the same spectral estimator demonstrate promising performance with realistic sensing matrices. Spectral methods are used to initialize non-convex optimization algorithms in phase retrieval, and our approach can boost the performance in this setting as well.

Our impossibility result is based on classical information-theory arguments. The spectral algorithm computes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp characterization of the spectral properties of this random matrix using tools from free probability and generalizing a recent result by Lu and Li.
Both the upper and lower bound generalize beyond phase retrieval to measurements $y_i$ produced according to a generalized linear model. As a byproduct of our analysis, we compare the threshold of the proposed spectral method with that of a message passing algorithm.

Non-Convex Matrix Completion Against a Semi-Random Adversary

ABSTRACT. Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies heavily on the assumption that every entry is observed with exactly the same probability $p$, which is not realistic in practice.

In this paper, we investigate a more realistic semi-random model, where the probability of observing each entry is at least $p$. Even with this mild semi-random perturbation, we can construct counter-examples where existing non-convex algorithms get stuck in bad local optima.

In light of the negative results, we propose a pre-processing step that tries to re-weight the semi-random input, so that it becomes ``similar'' to a random input. We give a nearly-linear time algorithm for this problem, and show that after our pre-processing, all the local minima of the non-convex objective can be used to approximately recover the underlying ground-truth matrix.

ABSTRACT. Applied researchers often construct a network from data that has been collected from a random sample of nodes, with the goal to infer properties of the parent network from the sampled version. Two of the most widely used sampling schemes are \emph{subgraph sampling}, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and \emph{neighborhood sampling}, where we additionally observe the edges between the sampled vertices and their neighbors.

Despite their ubiquity, theoretical understanding of these sampling models in the context of statistical estimation has been lacking. In this paper, we study the problem of estimating the number of motifs as induced subgraphs under both models from a statistical perspective. We show that: for parent graph $G$ with maximal degree $d$, for any connected motif $h$ on $k$ vertices, to estimate the number of copies of $h$ in $G$, denoted by $s=\s(h,G)$, with a multiplicative error of $\epsilon$,
\begin{itemize}
\item For subgraph sampling, the optimal sampling ratio $p$ is $\Theta_{k}\ppth{\max\ssth{ \ppth{s\epsilon^2}^{-\frac{1}{k}}, \; \frac{d^{k-1}}{s\epsilon^{2}} }}$, which only depends on the size of the motif but \emph{not} its actual topology. Furthermore, we show that Horvitz-Thompson type estimators are universally optimal for any connected motifs.

\item For neighborhood sampling, we propose a family of estimators, encompassing and outperforming the Horvitz-Thompson estimator and achieve the sampling ratio $O_{k}\ppth{\min\ssth{ \ppth{\frac{d}{s\epsilon^2}}^{\frac{1}{k-1}}, \; \sqrt{\frac{d^{k-2}}{s\epsilon^2}}}}$, which again only depends on the size of $h$. This is shown to be optimal for all motifs up to $4$ vertices and cliques of all sizes.
\end{itemize}
The matching minimax lower bounds are established using certain algebraic properties of subgraph counts. These theoretical results allow us to quantify how much more informative neighborhood sampling is than subgraph sampling, as empirically verified by experiments on synthetic and real-world data. We also address the issue of adaptation to the unknown maximum degree, and study specific problems for parent graphs with additional structures, e.g., trees or planar graphs.

Breaking the $1/\sqrt{n}$ Barrier: Faster Rates for Permutation-based Models in Polynomial Time

ABSTRACT. Many applications, including rank aggregation and crowd-labeling, can be modeled in terms of a bivariate isotonic matrix with unknown permutations acting on its rows and columns. We consider the problem of estimating such a matrix based on noisy observations of a subset of its entries, and design and analyze polynomial-time algorithms that improve upon the state of the art. In particular, our results imply that any such $n \times n$ matrix can be estimated efficiently in the normalized Frobenius norm at rate $\widetilde O(n^{-3/4})$, thus narrowing the gap between $\widetilde O(n^{-1})$ and $\widetilde O(n^{-1/2})$, which were hitherto the rates of the most statistically and computationally efficient methods, respectively.

Geometric Lower Bounds for Distributed Parameter Estimation under Communication Constraints

ABSTRACT. We consider parameter estimation in distributed networks, where each node in the network observes an independent sample from an underlying distribution and has $k$ bits to communicate its sample to a centralized processor which computes an estimate of a desired parameter of the distribution. We develop lower bounds for the minimax risk of estimating the underlying parameter under squared $\ell_2$ loss for a large class of distributions. Our results show that under mild regularity conditions, the communication constraint reduces the effective sample size by a factor of $d$ when $k$ is small, where $d$ is the dimension of the estimated parameter. Furthermore, this penalty reduces at most exponentially with increasing $k$, which is the case for some models, e.g., estimating high-dimensional distributions. For other models however, we show that the sample size reduction is re-mediated only linearly with increasing $k$ when some sub-Gaussian structure is available. We apply our results to the distributed setting with product Bernoulli model, multinomial model, and both dense/sparse Gaussian location models which recover or strengthen existing results.

Our approach significantly deviates from existing approaches for developing information-theoretic lower bounds for communication-efficient estimation. We circumvent the need for strong data processing inequalities used in prior work and develop a geometric approach which builds on a new representation of the communication constraint. This approach allows us to strengthen and generalize existing results with tight logarithmic factors, as well as simpler and more transparent proofs.

ABSTRACT. In reinforcement learning (RL), problems with long planning horizons are perceived as very challenging. The recent advances in PAC RL, however, show that the sample complexity of RL does not depend on planning horizon except at a superficial level. How can we explain such a difference? In this document, we argue that the technical assumptions in the upper bounds might have hidden away the challenges of long horizons. We therefore ask the question of whether one can prove a lower bound with horizon dependence when such assumptions are removed, and provide a few observations on the desired characteristics of the lower bound construction.

ABSTRACT. We ask whether there exists an efficient unsupervised learning algorithm for mixture of Gaussians in the over-complete case (number of mixtures is larger than the dimension). The notion of learning is taken to be worst-case compression-based, to allow for improper
learning.

Mobile Health: Adventures in Sequential Experimentation and Reinforcement Learning!

ABSTRACT. The field of mobile health or digital health provides an exciting and challenging testbed for sequential experimentation and reinforcement learning.
This is particularly the case when the goal is develop mobile health interventions for individuals suffering from chronic, relapsing disorders.
Some of the challenges include treatment actions that tend to have positive effects on immediate rewards but with great potential for negative impact on future rewards, non-stationary reward functions, immature domain science concerning the system dynamics, complex data arriving at different time scales, multiple classes of treatment actions occurring at differing time scales and targeted at different immediate rewards. The most natural immediate rewards may be delayed. Expensive data collection means that data resulting from sequential experimentation & use of RL must allow for a variety of subsequent causal inference analyses.
We discuss some of our solutions including sequences of clinical trials, use of prediction algorithms to ensure constraints on the burdensome treatment actions yet still permit learning, sequential causal learning on each individual in a trial and batch learning after a clinical trial is over. Throughout we illustrate the issues with completed/ in progress or in planning clinical trials.

An Estimate Sequence for Geodesically Convex Optimization

ABSTRACT. In this paper, we propose a Riemannian version of Nesterov's Accelerated Gradient algorithm (R-AGD), and show that for geodesically smooth and strongly convex problems, within a neighborhood whose radius depends on the condition number as well as the sectional curvature of the manifold, R-AGD converges to the minimum with acceleration. Unlike the algorithm in (Liu et al., 2017) which requires the exact solution to a nonlinear equation which in turn may be intractable, our algorithm is constructive and efficiently implementable. Our proof exploits a new estimate sequence and a novel bound on the nonlinear metric distortion, both ideas may be of independent interest.

Averaged Stochastic Gradient Descent on Riemannian Manifolds

ABSTRACT. We consider the minimization of a function defined on a Riemannian manifold M only accessible through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on M to an averaged iterate sequence with a robust and fast O(1/n) convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we show how these ideas apply to the case of streaming k-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

Accelerating Stochastic Gradient Descent for Least Squares Regression

ABSTRACT. There is widespread sentiment that fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical errors [Paige71, Proakis74, Polyak87, Greenbaum89, Roy and Shynk90, Sharma et al. 98, dAspremont08, Devolder et al. 13,14, Yuan et al. 16]. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes this conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

ABSTRACT. Nesterov's accelerated gradient descent (AGD), an instance of the general family of ``momentum methods,'' provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern nonconvex applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov's AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, matching the best known convergence rate, which is faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting---all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Exponential convergence of testing error for stochastic gradient methods

ABSTRACT. We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed. To achieve these rates of convergence we show sharper high-probability bounds with respect to the number of observations for stochastic gradient descent.

Cutting plane methods can be extended into nonconvex optimization

ABSTRACT. We show that it is possible to obtain an $O(\epsilon^{-4/3})$ rate (including computational cost) for finding stationary points of nonconvex functions using cutting plane methods. This improves on the best known rate achieved by cubic regularization of $O(\epsilon^{-3/2})$. Our techniques utilize the convex until proven guilty principle proposed by Carmon, Duchi, Hinder, and Sidford.

Online Variance Reduction for Stochastic Optimization

ABSTRACT. Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform importance sampling techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling datapoints, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.

Minimax Bounds on Stochastic Batched Convex Optimization

ABSTRACT. We study the stochastic batched convex optimization problem, in which we use many \emph{parallel} observations to optimize a convex function given limited rounds of interaction. In each of M rounds, an algorithm may query for information at n points, and after issuing all n queries, it receives unbiased noisy function and/or (sub)gradient evaluations at the n points. After M such rounds, the algorithm must output an estimator. We provide lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for the collections of Lipschitz convex and smooth strongly convex
functions. The rates we provide exhibit two interesting phenomena: (1) in terms of the batch size n, the rate of convergence of batched algorithms (nearly) achieves the conventional fully sequential rate once M = O(d log log n), where d is the dimension of the domain, while (2) the rate may exponentially degrade as the dimension d increases, in distinction from fully sequential settings.

Iterate Averaging as Regularization for Stochastic Gradient Descent

ABSTRACT. We propose and analyze a variant of the classic Polyak--Ruppert averaging scheme, broadly used in stochastic gradient methods. Rather than a uniform average of the iterates, we consider a weighted average, with weights decaying in a geometric fashion. In the context of linear least squares regression, we show that this averaging scheme has a the same regularizing effect, and indeed is asymptotically equivalent, to ridge regression. In particular, we derive-finite sample bounds for the proposed approach that match the best known results for regularized stochastic gradient methods.